I was working on manual textcat (using textcat.manual) yesterday and had to stop, so I saved my work assuming I could exclude a existing dataset, which according to the documentation is possible. However when I run the following command, it does not exclude the data in my original dataset "cats_m-1" and still asks me the questions I already answered in "cats_m-1"
prodigy textcat.manual cats_m-f2 data\all_data.jsonl --label <Long list of labels here> -E -e cats_m-1
I could just write a script that automatically removes the data in my original dataset from my jsonl file, but this seems unnecessary when there is a special function to exclude a existing dataset.
Hi! That workflow sounds reasonable and your workflow looks correct. By default, the textcat.manual recipe with options will exclude based on the input hash (representing the text), so if a question with the same input hash already exists, it should skip it.
If you look at the _input_hash values in the data, are they identical? And what does it say about excluding in the logs when you set PRODIGY_LOGGING=basic?
If you look at the _input_hash values in the data, are they identical?
They indeed are.
And what does it say about excluding in the logs when you set PRODIGY_LOGGING=basic ?
I've not been able to figure out how to enable this. I looked in the documentation, and it simply tells me to set a environment variable. I am unable to find where this is located.
Ah, so this is just a regular environment variable that you can set however you'd normally set an environment variable on your platform (differs by platform, dev environment etc.).
Cool, glad it worked! This looks like it's just the log on startup – the filtering happens when the stream is loaded and processed, so you might just have to open the app in the browser so it starts queuing up some examples.
Glad to hear it's working! Still a little strange, though, because the logging just toggles Python's logging and I don't see how there could be any interaction there Anyway, if it happens again let me know. The only possible theory I can think of is that for some reason, the -e cats_m-1 at the end of your command may have not been interpreted correctly... but then again, this would have likely caused an error.