textcat teach examples from source or from dataset

MBSanchez · August 12, 2019, 9:54am

Hi!

I have text from research projects from different fields and I want to classify them by research field. I have around 50k projects and I have a defined a label for 8k. For the rest I would like to apply the model in order to assign them a label.

First of all, as I have limited training data, I have used spacy pretrain in order to use transfer learning to initialize the model with text from the research projects.

After I have added the label data, with the same amount of reject and accept examples to a dataset in prodigy (8k accepted and the same 8k with a different label and rejected)

I have use the textcat.batch-train for training the model that will classify the research projects

prodigy textcat.batch-train textcat_test_reject en_vectors_web_lg -t2v "./pretrained-model/model22.bin" --eval-split 0.2 --output /tmp/model

The problem here is that I think I didn't understand the textcat recipes properly because I thought that if I use textcat.teach giving the all 50k projects as a source, I was going to being able to accept or reject labels assigned to all the projects and not just those that have a label already (that are in the dataset).

prodigy textcat.teach textcat_test_reject en_vectors_web_lg textcat_all_projects.jsonl --label ENERGY, HEALTH,...

Why I can only assign a label to the projects that are in the dataset and not to all projects in the source? I am confused and I hope the question is not as confused as I am

Thanks!

ines · August 13, 2019, 9:56am

I'm not sure I understand your question correctly – but the dataset you pass into the recipes is the name of the dataset to save annotations to, not the source of the data. That's what the source argument is for. So there's usually no need to import data before annotating – unless you have pre-labelled data that you want to combine with new annotations.

When using recipes like textcat.teach, also keep in mind that they won't show you all examples. The main point of the active learning is to help select the most relevant examples for annotation – so Prodigy may skip very confident scores in favour of uncertain scores. If you just want to label all your examples as they come in, it probably makes more sense to use a manual recipe like textcat.manual instead.

MBSanchez · August 14, 2019, 1:43pm

Thanks for you explanation Ines.

Indeed I assume that the file with all the examples was on the source argument and the ones imported in the dataset where the the pre-labelled data that are going to be combined with the new annotations.

What I didn't understand was that it is showing only the most relevant examples because I also have the problem that, after rejecting or accepting a few examples (between 4 and 8) that textcat.teach is giving me, I receive the message of No task available. What is this happening? Is there any error in the dataset I am providing maybe?

Thanks again!

MBSanchez · August 14, 2019, 7:20pm

Just to give more details. The source has 50 thousand examples and the format is:

{"text": "This is a sentence."}
{"text": "This is another sentence."}

And I have imported pre-labelled data into the dataset:

8 thousand accepted examples.
The same 8 thousand with a different labels and rejected.

There could be a problem on having duplicates on the dataset?

Thanks!

ines · August 15, 2019, 9:34am

If you have duplicates, then yes, that could explain a lot. By default, Prodigy will skip examples that are already present in the dataset, so you're not annotating the same text plus label twice. This, in combination with the example selection and model picking what to ask about, can mean that there's not much left of your 50k examples.

MBSanchez · August 15, 2019, 9:42am

But there are still 42k examples that are not in the dataset and it is only showing 8 from those 42k.

What I mean with duplicates is that I have seen in the docs that accepted and rejected examples have to be provided and I have provided the same 8k examples with the correct labels and accepted and the wrong labels and rejected. Those are the ones that are duplicated but there is still 42k examples not duplicated.

ines · August 15, 2019, 10:34am

If you run the recipe with PRODIGY_LOGGING=basic, is there anything in the logs that look suspicious? Like, skipping a bunch of examples?

MBSanchez · August 15, 2019, 11:44am

PRODIGY_LOGGING=basic prodigy textcat.teach textcat_test model_v2 textcat_all_projects.jsonl --label NABS132,NABS07,NABS11,NABS05,NABS08,NABS02,NABS04,NABS131,NABS06,NABS03,NABS14

This is the log:

13:29:00 - APP: Using Hug endpoints (deprecated)

13:29:01 - RECIPE: Calling recipe 'textcat.teach'

Using 11 labels: NABS132, NABS07, NABS11, NABS05, NABS08, NABS02, NABS04, NABS131, NABS06, NABS03, NABS14

13:29:01 - RECIPE: Starting recipe textcat.teach

13:29:43 - RECIPE: Creating TextClassifier with model model_v2

13:29:43 - LOADER: Using file extension 'jsonl' to find loader

13:29:43 - LOADER: Loading stream from jsonl

13:29:43 - LOADER: Rehashing stream

13:29:43 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)

13:29:43 - CONTROLLER: Initialising from recipe

13:29:43 - VALIDATE: Creating validator for view ID 'classification'

13:29:43 - DB: Initialising database SQLite

13:29:43 - DB: Connecting to database SQLite

13:29:44 - DB: Loading dataset 'textcat_test' (17396 examples)

13:29:45 - DB: Creating dataset '2019-08-15_13-29-43'

13:29:45 - DatasetFilter: Getting hashes for excluded examples

13:29:45 - DatasetFilter: Excluding 17396 tasks from datasets: textcat_test

13:29:45 - CONTROLLER: Initialising from recipe

13:29:45 - CORS: initialize wildcard "*" CORS origins

✨ Starting the web server at http://localhost:8080 ...

Open the app in your browser and start annotating!

13:30:05 - GET: /project

Task queue depth is 1

13:30:05 - POST: /get_session_questions

13:30:05 - FEED: Finding next batch of questions in stream

13:30:05 - CONTROLLER: Validating the first batch for session: textcat_test-default

13:30:05 - FILTER: Filtering duplicates from stream

13:30:05 - FILTER: Filtering out empty examples for key 'text'

13:30:06 - RESPONSE: /get_session_questions (10 examples)

13:32:23 - POST: /get_session_questions

13:32:23 - FEED: Finding next batch of questions in stream

13:32:23 - RESPONSE: /get_session_questions (10 examples)

13:32:49 - POST: /give_answers (received 8, session ID 'textcat_test-default')

13:32:49 - CONTROLLER: Receiving 8 answers

13:32:49 - PROGRESS: Estimating progress of 0.0909

13:32:49 - DB: Creating dataset 'textcat_test-default'

13:32:49 - DB: Getting dataset 'textcat_test'

13:32:49 - DB: Getting dataset 'textcat_test-default'

13:32:49 - DB: Added 8 examples to 2 datasets

13:32:49 - CONTROLLER: Added 8 answers to dataset 'textcat_test' in database SQLite

13:32:49 - RESPONSE: /give_answers

Is it normal to get the get_session_questions (10 examples) in the log? Do you see something else that could explain why I have been asked only for 8 examples?

Thanks!

MBSanchez · August 15, 2019, 12:01pm

I have just realised also that if I save the 8 examples after showing No task available and reload the browser, it is giving me more examples (6 different examples the second time before showing No task available) then I save the examples and reload the browser again and, in the third time, I've been able to answer 22 examples and it didn't say No task available for the moment.

ines · August 15, 2019, 12:46pm

Ahh, interesting – thanks for checking! Your model_v2 is based on the large word vectors, right? Maybe that just makes it slightly too slow for the current batch size, and the update is blocking it for long enough that it's not able to send out the next batch in time. Maybe try using a batch_size higher than 10? This will send out more questions at once so you have more to annotate, and will also update the model in larger batches at once.

MBSanchez · August 15, 2019, 12:57pm

I guess the result is a model based on the large word vectors, indeed.

I am going to try using a higher batch_size.

Thanks!

Topic		Replies	Views
Textcat - teach to train. usage , textcat	2	553	September 1, 2022
Converting choice annotations to textcat annotations usage , textcat , custom , solved	6	1418	September 5, 2018
textcat.teach: how to exclude target dataset examples by hash, but auxiliary datasets by input? usage , textcat , best-practices	1	501	August 23, 2022
Multi label tagging usage , textcat	1	1180	September 10, 2018
textcat.batch-train question	7	496	November 28, 2022

textcat teach examples from source or from dataset

Related topics