Textcat not excluding dataset.

Tim · June 28, 2020, 7:09pm

I was working on manual textcat (using textcat.manual) yesterday and had to stop, so I saved my work assuming I could exclude a existing dataset, which according to the documentation is possible. However when I run the following command, it does not exclude the data in my original dataset "cats_m-1" and still asks me the questions I already answered in "cats_m-1"

prodigy textcat.manual cats_m-f2 data\all_data.jsonl --label <Long list of labels here> -E -e cats_m-1

I could just write a script that automatically removes the data in my original dataset from my jsonl file, but this seems unnecessary when there is a special function to exclude a existing dataset.

ines · June 29, 2020, 8:22am

Hi! That workflow sounds reasonable and your workflow looks correct. By default, the textcat.manual recipe with options will exclude based on the input hash (representing the text), so if a question with the same input hash already exists, it should skip it.

If you look at the _input_hash values in the data, are they identical? And what does it say about excluding in the logs when you set PRODIGY_LOGGING=basic?

Tim · June 29, 2020, 11:34am

Hey.

If you look at the _input_hash values in the data, are they identical?

They indeed are.

And what does it say about excluding in the logs when you set PRODIGY_LOGGING=basic ?

I've not been able to figure out how to enable this. I looked in the documentation, and it simply tells me to set a environment variable. I am unable to find where this is located.

ines · June 29, 2020, 12:58pm

Ah, so this is just a regular environment variable that you can set however you'd normally set an environment variable on your platform (differs by platform, dev environment etc.).

If you're on a Mac or on Linux, you can do the following. (Also see here for an example: Installation & Setup · Prodigy · An annotation tool for AI, Machine Learning & NLP)

PRODIGY_LOGGING=basic prodigy textcat.manual cats_m-f2 data\all_data.jsonl --label <Long list of labels here> -E -e cats_m-1

Tim · June 29, 2020, 4:49pm

Got it, thank you.
Log below:

18:47:04: RECIPE: Starting recipe textcat.manual
18:47:04: RECIPE: Annotating with 9 labels
18:47:04: LOADER: Using file extension 'jsonl' to find loader
18:47:04: LOADER: Loading stream from jsonl
18:47:04: LOADER: Rehashing stream
18:47:04: VALIDATE: Validating components returned by recipe
18:47:04: CONTROLLER: Initialising from recipe
18:47:04: VALIDATE: Creating validator for view ID 'choice'
18:47:04: VALIDATE: Validating Prodigy and recipe config
18:47:04: DB: Initializing database SQLite
18:47:04: DB: Connecting to database SQLite
18:47:04: DB: Creating dataset '2020-06-29_18-47-04'
18:47:04: CONTROLLER: Initialising from recipe
18:47:04: CONTROLLER: Validating the first batch for session: None
18:47:04: PREPROCESS: Add multiple choice options for 9 labels
18:47:04: FILTER: Filtering duplicates from stream
18:47:04: FILTER: Filtering out empty examples for key 'text'
18:47:04: CORS: initialized with wildcard "*" CORS origins

Personally can't see anything related to excluding a dataset.

ines · June 29, 2020, 4:56pm

Cool, glad it worked! This looks like it's just the log on startup – the filtering happens when the stream is loaded and processed, so you might just have to open the app in the browser so it starts queuing up some examples.

Tim · June 29, 2020, 5:00pm

Ok, I've done that. It's now showing the following.

INFO: ::1:52278 - "GET / HTTP/1.1" 200 OK
INFO: ::1:52278 - "GET /bundle.js HTTP/1.1" 200 OK
18:56:46: GET: /project
INFO: ::1:52278 - "GET /project HTTP/1.1" 200 OK
18:56:46: POST: /get_session_questions
18:56:46: FEED: Finding next batch of questions in stream
18:56:46: RESPONSE: /get_session_questions (10 examples)
INFO: ::1:52278 - "POST /get_session_questions HTTP/1.1" 200 OK
18:57:17: POST: /get_session_questions
18:57:17: FEED: Finding next batch of questions in stream
18:57:17: FEED: skipped: 1884547963 (text)
18:57:17: FEED: skipped: -980977358 (text)
18:57:17: FEED: skipped: 1450570456 (text)
18:57:17: FEED: skipped: 1842367882 (text)
18:57:17: FEED: skipped: 981677283 (text)
18:57:17: FEED: skipped: -2003029328 (text)
18:57:17: FEED: skipped: -1195566099 (text)
18:57:17: FEED: skipped: 966435862 (text)
18:57:17: RESPONSE: /get_session_questions (10 examples)
INFO: ::1:52279 - "POST /get_session_questions HTTP/1.1" 200 OK

For some reason, I think it's working right now. Which is weird because my command is exactly the same, the only difference is that I enabled logging.

ines · June 30, 2020, 11:43am

Glad to hear it's working! Still a little strange, though, because the logging just toggles Python's logging and I don't see how there could be any interaction there Anyway, if it happens again let me know. The only possible theory I can think of is that for some reason, the -e cats_m-1 at the end of your command may have not been interpreted correctly... but then again, this would have likely caused an error.

Topic		Replies	Views
textcat.teach: how to exclude target dataset examples by hash, but auxiliary datasets by input? usage , textcat , best-practices	1	502	August 23, 2022
--exclude in textcat teach is not working as expected. textcat , more-info-needed	2	398	December 15, 2020
Textcat with customer sorter didn't exclude dataset textcat	1	390	March 20, 2020
Resume Annotation Session with Prodigy - Text Classification textcat	1	1642	June 14, 2018
prodigy.json excluded_by input seems not working done , streams	7	614	August 3, 2020

Textcat not excluding dataset.

Related topics