textcat teach examples from source or from dataset

Hi!

I have text from research projects from different fields and I want to classify them by research field. I have around 50k projects and I have a defined a label for 8k. For the rest I would like to apply the model in order to assign them a label.

First of all, as I have limited training data, I have used spacy pretrain in order to use transfer learning to initialize the model with text from the research projects.

After I have added the label data, with the same amount of reject and accept examples to a dataset in prodigy (8k accepted and the same 8k with a different label and rejected)

I have use the textcat.batch-train for training the model that will classify the research projects

prodigy textcat.batch-train textcat_test_reject en_vectors_web_lg -t2v "./pretrained-model/model22.bin" --eval-split 0.2 --output /tmp/model

The problem here is that I think I didn't understand the textcat recipes properly because I thought that if I use textcat.teach giving the all 50k projects as a source, I was going to being able to accept or reject labels assigned to all the projects and not just those that have a label already (that are in the dataset).

prodigy textcat.teach textcat_test_reject en_vectors_web_lg textcat_all_projects.jsonl --label ENERGY, HEALTH,...

Why I can only assign a label to the projects that are in the dataset and not to all projects in the source? I am confused and I hope the question is not as confused as I am :slight_smile:

Thanks!

I'm not sure I understand your question correctly – but the dataset you pass into the recipes is the name of the dataset to save annotations to, not the source of the data. That's what the source argument is for. So there's usually no need to import data before annotating – unless you have pre-labelled data that you want to combine with new annotations.

When using recipes like textcat.teach, also keep in mind that they won't show you all examples. The main point of the active learning is to help select the most relevant examples for annotation – so Prodigy may skip very confident scores in favour of uncertain scores. If you just want to label all your examples as they come in, it probably makes more sense to use a manual recipe like textcat.manual instead.

Thanks for you explanation Ines.

Indeed I assume that the file with all the examples was on the source argument and the ones imported in the dataset where the the pre-labelled data that are going to be combined with the new annotations.

What I didn't understand was that it is showing only the most relevant examples because I also have the problem that, after rejecting or accepting a few examples (between 4 and 8) that textcat.teach is giving me, I receive the message of No task available. What is this happening? Is there any error in the dataset I am providing maybe?

Thanks again!

Just to give more details. The source has 50 thousand examples and the format is:

{"text": "This is a sentence."}
{"text": "This is another sentence."}

And I have imported pre-labelled data into the dataset:

  • 8 thousand accepted examples.

  • The same 8 thousand with a different labels and rejected.

There could be a problem on having duplicates on the dataset?

Thanks!

If you have duplicates, then yes, that could explain a lot. By default, Prodigy will skip examples that are already present in the dataset, so you're not annotating the same text plus label twice. This, in combination with the example selection and model picking what to ask about, can mean that there's not much left of your 50k examples.

But there are still 42k examples that are not in the dataset and it is only showing 8 from those 42k.

What I mean with duplicates is that I have seen in the docs that accepted and rejected examples have to be provided and I have provided the same 8k examples with the correct labels and accepted and the wrong labels and rejected. Those are the ones that are duplicated but there is still 42k examples not duplicated.

If you run the recipe with PRODIGY_LOGGING=basic, is there anything in the logs that look suspicious? Like, skipping a bunch of examples?

PRODIGY_LOGGING=basic prodigy textcat.teach textcat_test model_v2 textcat_all_projects.jsonl --label NABS132,NABS07,NABS11,NABS05,NABS08,NABS02,NABS04,NABS131,NABS06,NABS03,NABS14

This is the log:

13:29:00 - APP: Using Hug endpoints (deprecated)

13:29:01 - RECIPE: Calling recipe 'textcat.teach'

Using 11 labels: NABS132, NABS07, NABS11, NABS05, NABS08, NABS02, NABS04, NABS131, NABS06, NABS03, NABS14

13:29:01 - RECIPE: Starting recipe textcat.teach

13:29:43 - RECIPE: Creating TextClassifier with model model_v2

13:29:43 - LOADER: Using file extension 'jsonl' to find loader

13:29:43 - LOADER: Loading stream from jsonl

13:29:43 - LOADER: Rehashing stream

13:29:43 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)

13:29:43 - CONTROLLER: Initialising from recipe

13:29:43 - VALIDATE: Creating validator for view ID 'classification'

13:29:43 - DB: Initialising database SQLite

13:29:43 - DB: Connecting to database SQLite

13:29:44 - DB: Loading dataset 'textcat_test' (17396 examples)

13:29:45 - DB: Creating dataset '2019-08-15_13-29-43'

13:29:45 - DatasetFilter: Getting hashes for excluded examples

13:29:45 - DatasetFilter: Excluding 17396 tasks from datasets: textcat_test

13:29:45 - CONTROLLER: Initialising from recipe

13:29:45 - CORS: initialize wildcard "*" CORS origins

✨ Starting the web server at http://localhost:8080 ...

Open the app in your browser and start annotating!

13:30:05 - GET: /project

Task queue depth is 1

13:30:05 - POST: /get_session_questions

13:30:05 - FEED: Finding next batch of questions in stream

13:30:05 - CONTROLLER: Validating the first batch for session: textcat_test-default

13:30:05 - FILTER: Filtering duplicates from stream

13:30:05 - FILTER: Filtering out empty examples for key 'text'

13:30:06 - RESPONSE: /get_session_questions (10 examples)

13:32:23 - POST: /get_session_questions

13:32:23 - FEED: Finding next batch of questions in stream

13:32:23 - RESPONSE: /get_session_questions (10 examples)

13:32:49 - POST: /give_answers (received 8, session ID 'textcat_test-default')

13:32:49 - CONTROLLER: Receiving 8 answers

13:32:49 - PROGRESS: Estimating progress of 0.0909

13:32:49 - DB: Creating dataset 'textcat_test-default'

13:32:49 - DB: Getting dataset 'textcat_test'

13:32:49 - DB: Getting dataset 'textcat_test-default'

13:32:49 - DB: Added 8 examples to 2 datasets

13:32:49 - CONTROLLER: Added 8 answers to dataset 'textcat_test' in database SQLite

13:32:49 - RESPONSE: /give_answers

Is it normal to get the get_session_questions (10 examples) in the log? Do you see something else that could explain why I have been asked only for 8 examples?

Thanks!

I have just realised also that if I save the 8 examples after showing No task available and reload the browser, it is giving me more examples (6 different examples the second time before showing No task available) then I save the examples and reload the browser again and, in the third time, I've been able to answer 22 examples and it didn't say No task available for the moment.

Ahh, interesting – thanks for checking! Your model_v2 is based on the large word vectors, right? Maybe that just makes it slightly too slow for the current batch size, and the update is blocking it for long enough that it's not able to send out the next batch in time. Maybe try using a batch_size higher than 10? This will send out more questions at once so you have more to annotate, and will also update the model in larger batches at once.

I guess the result is a model based on the large word vectors, indeed.

I am going to try using a higher batch_size.

Thanks!