Textcat not excluding dataset.

I was working on manual textcat (using textcat.manual) yesterday and had to stop, so I saved my work assuming I could exclude a existing dataset, which according to the documentation is possible. However when I run the following command, it does not exclude the data in my original dataset "cats_m-1" and still asks me the questions I already answered in "cats_m-1"

prodigy textcat.manual cats_m-f2 data\all_data.jsonl --label <Long list of labels here> -E -e cats_m-1

I could just write a script that automatically removes the data in my original dataset from my jsonl file, but this seems unnecessary when there is a special function to exclude a existing dataset.

1 Like

Hi! That workflow sounds reasonable and your workflow looks correct. By default, the textcat.manual recipe with options will exclude based on the input hash (representing the text), so if a question with the same input hash already exists, it should skip it.

If you look at the _input_hash values in the data, are they identical? And what does it say about excluding in the logs when you set PRODIGY_LOGGING=basic?

Hey.

If you look at the _input_hash values in the data, are they identical?

They indeed are.

And what does it say about excluding in the logs when you set PRODIGY_LOGGING=basic ?

I've not been able to figure out how to enable this. I looked in the documentation, and it simply tells me to set a environment variable. I am unable to find where this is located.

Ah, so this is just a regular environment variable that you can set however you'd normally set an environment variable on your platform (differs by platform, dev environment etc.).

If you're on a Mac or on Linux, you can do the following. (Also see here for an example: Installation & Setup · Prodigy · An annotation tool for AI, Machine Learning & NLP)

PRODIGY_LOGGING=basic prodigy textcat.manual cats_m-f2 data\all_data.jsonl --label <Long list of labels here> -E -e cats_m-1

Got it, thank you.
Log below:

18:47:04: RECIPE: Starting recipe textcat.manual
18:47:04: RECIPE: Annotating with 9 labels
18:47:04: LOADER: Using file extension 'jsonl' to find loader
18:47:04: LOADER: Loading stream from jsonl
18:47:04: LOADER: Rehashing stream
18:47:04: VALIDATE: Validating components returned by recipe
18:47:04: CONTROLLER: Initialising from recipe
18:47:04: VALIDATE: Creating validator for view ID 'choice'
18:47:04: VALIDATE: Validating Prodigy and recipe config
18:47:04: DB: Initializing database SQLite
18:47:04: DB: Connecting to database SQLite
18:47:04: DB: Creating dataset '2020-06-29_18-47-04'
18:47:04: CONTROLLER: Initialising from recipe
18:47:04: CONTROLLER: Validating the first batch for session: None
18:47:04: PREPROCESS: Add multiple choice options for 9 labels
18:47:04: FILTER: Filtering duplicates from stream
18:47:04: FILTER: Filtering out empty examples for key 'text'
18:47:04: CORS: initialized with wildcard "*" CORS origins

Personally can't see anything related to excluding a dataset.

Cool, glad it worked! This looks like it's just the log on startup – the filtering happens when the stream is loaded and processed, so you might just have to open the app in the browser so it starts queuing up some examples.

Ok, I've done that. It's now showing the following.

INFO: ::1:52278 - "GET / HTTP/1.1" 200 OK
INFO: ::1:52278 - "GET /bundle.js HTTP/1.1" 200 OK
18:56:46: GET: /project
INFO: ::1:52278 - "GET /project HTTP/1.1" 200 OK
18:56:46: POST: /get_session_questions
18:56:46: FEED: Finding next batch of questions in stream
18:56:46: RESPONSE: /get_session_questions (10 examples)
INFO: ::1:52278 - "POST /get_session_questions HTTP/1.1" 200 OK
18:57:17: POST: /get_session_questions
18:57:17: FEED: Finding next batch of questions in stream
18:57:17: FEED: skipped: 1884547963 (text)
18:57:17: FEED: skipped: -980977358 (text)
18:57:17: FEED: skipped: 1450570456 (text)
18:57:17: FEED: skipped: 1842367882 (text)
18:57:17: FEED: skipped: 981677283 (text)
18:57:17: FEED: skipped: -2003029328 (text)
18:57:17: FEED: skipped: -1195566099 (text)
18:57:17: FEED: skipped: 966435862 (text)
18:57:17: RESPONSE: /get_session_questions (10 examples)
INFO: ::1:52279 - "POST /get_session_questions HTTP/1.1" 200 OK

For some reason, I think it's working right now. Which is weird because my command is exactly the same, the only difference is that I enabled logging.

Glad to hear it's working! Still a little strange, though, because the logging just toggles Python's logging and I don't see how there could be any interaction there :thinking: Anyway, if it happens again let me know. The only possible theory I can think of is that for some reason, the -e cats_m-1 at the end of your command may have not been interpreted correctly... but then again, this would have likely caused an error.