textcat.teach showing same text twice (and not using active learning?)

We are trying to use textcat.teach / active learning for text classification with multiple labels, e.g. wonen (housing), economie, and criminaliteit (crime)

It seems to work and the coding interface is really nice and efficient, but two things are strange:

  1. The texts presented do not seem to be very relevant to the label. I’d expect an accept/reject ratio of close to 0.5, but it’s closer to accepting 10% of cases. Also, the texts for one label seem the same as for another label. If I look at db-out they also get the same score/priority regardless of label:

    $ ./prodigy db-out topics_train | grep “Wietkwekerij opgerold”
    {“text”:“Wietkwekerij opgerold in Kruidenwijk […] “,“meta”:{“id”:188205522,“headline”:“Wietkwekerij opgerold in Kruidenwijk”,“score”:0.0406740457},”_input_hash”:1753281594,"_task_hash":-258418788,“label”:“economie”,“score”:0.0406740457,“priority”:0.0406740457,“spans”:[],“answer”:“reject”}
    {“text”:“Wietkwekerij opgerold in Kruidenwijk […]”,“meta”:{“id”:188205522,“headline”:“Wietkwekerij opgerold in Kruidenwijk”,“score”:0.0406740457},"_input_hash":1753281594,"_task_hash":-258418788,“label”:“wonen”,“score”:0.0406740457,“priority”:0.0406740457,“spans”:[],“answer”:“reject”}
    {“text”:“Wietkwekerij opgerold in Kruidenwijk […]”,“meta”:{“id”:188205522,“headline”:“Wietkwekerij opgerold in Kruidenwijk”,“score”:0.0406740457},"_input_hash":1753281594,"_task_hash":-258418788,“label”:“economie”,“score”:0.0406740457,“priority”:0.0406740457,“spans”:[],“answer”:“reject”}
    {“text”:“Wietkwekerij opgerold in Kruidenwijk […]”,“meta”:{“id”:188205522,“headline”:“Wietkwekerij opgerold in Kruidenwijk”,“score”:0.0406740457},"_input_hash":1753281594,"_task_hash":-258418788,“label”:“criminaliteit”,“score”:0.0406740457,“priority”:0.0406740457,“spans”:[],“answer”:“accept”}

  2. When I rerun the textcat.teach, I would expect to get different examples than in the first session. However, I get the same examples again, even though they are stored in the dataset. See above for the db-out, see below for the screenshot taken after starting a new session after the db-out.

Are we using textcat.teach correctly? Is it possible that our initial model is not trained correctly? It did seem to have around 60% accuracy on a test set (which is not horrible for 12 possible labels)

Thanks again!

– Wouter

By default, Prodigy makes no assumptions about what the current, existing dataset “means” – it’s only used to store the annotations. However, you can tell Prodigy to explicitly exclude examples you’ve already annotated in one or more datasets using the --exclude flag. For example --exclude topics_train will exclude all examples in the current dataset that have the same _task_hash as the incoming tasks. This means you won’t get asked the same question twice – but you’ll still be able to annotate the same text with a different label.

(You can also use the same mechanism when you create your evaluation set btw, to make sure no training examples end up in your evaluation data, and vice versa.)

This definitely looks suspicious. In general, you can expect to see some variance in the scores. By default, the active learning algorithm uses an exponential moving average of the scores to determine what to show you and which examples to skip. So depending on the order of the data that comes in, it can sometimes take a few batches for it to adjust. But if the model is pre-trained, this shouldn’t be very significant.

Could you post the exact textcat.teach command you ran? And what does your model produce when you just load it into spaCy directly, for example:

nlp = spacy.load('/path/to/your-model')
doc = nlp(u"Wietkwekerij opgerold in Kruidenwijk...")  # etc.
print(doc.cats)

You can also set the environment variable PRODIGY_LOGGING=basic when you run textcat.teach to see what’s going on behind the scenes and what Prodigy is doing.

Hey Ines,

Thanks again for the super quick reply!

$ cat test_model.py
import spacy
nlp = spacy.load('/home/nel/Dropbox/svdj_instrument/initial_model_topics')
doc = nlp("Wietkwekerij opgerold in Kruidenwijk\n\nOp de Melkdistelstraat in de Almeerse Kruidenwijk is woensdagochtend een wietplantage opgerold. Dat schrijft een wijkagent op Facebook.\u00a0\u00a0 In totaal stonden er 168 planten in het huis. Alle spullen voor de plantage worden geruimd en de planten worden vernietigd.\u00a0 Er zijn twee mensen aangehouden.\u00a0")
print(doc.cats)
$ ~/prodigy_env/bin/python test_model.py 
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
[.. 3 more similar warnings ..] 
{'wonen': 0.04067404568195343, 'economie': 0.004906974732875824, 'onderwijs': 0.04995676875114441, 'criminaliteit': 0.051752906292676926, 'uitgaan': 0.0033709031995385885, 'milieu': 0.057528816163539886, 'campagne': 0.05357237905263901, 'democratie': 0.008149554952979088, 'cultuur': 0.001362256589345634, 'verkeer': 0.05843031033873558, 'zorg': 0.22781996428966522, 'integratie': 0.012870367616415024}

So that seems ok-ish, right? Although all scrores are pretty low, maybe that’s a probem? Should these scores be the same as from the db-out?

Command used (modified with the logging as requested)

$ PRODIGY_LOGGING=basic ./prodigy textcat.teach topics_train initial_model_topics data/topics_train.jsonl --label criminaliteit
[.. snipped binary incompatibility warnings ..]
16:44:42 - RECIPE: Calling recipe 'textcat.teach'
Using 1 labels: criminaliteit
16:44:42 - RECIPE: Starting recipe textcat.teach
16:44:42 - DB: Initialising database SQLite
16:44:42 - DB: Connecting to database SQLite
16:44:43 - RECIPE: Creating TextClassifier with model initial_model_topics
16:44:43 - LOADER: Using file extension 'jsonl' to find loader
16:44:43 - LOADER: Loading stream from jsonl
16:44:43 - LOADER: Rehashing stream
16:44:43 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
16:44:43 - CONTROLLER: Initialising from recipe
16:44:43 - VALIDATE: Creating validator for view ID 'classification'
16:44:43 - DB: Loading dataset 'topics_train' (1013 examples)
16:44:43 - DB: Creating dataset '2018-08-11_16-44-43'
16:44:43 - CONTROLLER: Validating the first batch
16:44:43 - CONTROLLER: Iterating over stream
16:44:43 - FILTER: Filtering duplicates from stream
16:44:43 - FILTER: Filtering out empty examples for key 'text'

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

16:45:02 - GET: /project
16:45:02 - GET: /get_questions
16:45:02 - CONTROLLER: Returning a batch of tasks from the queue
16:45:02 - RESPONSE: /get_questions (10 examples)

So if you ‘teach’ in multiple sessions, you should exclude the dataset itself to prevent getting the same example twice? No problem, but it feels strange that that is not the default behaviour (but there’s probably a good reason…)

Thanks again!

– Wouter

Edit:

Okay, so this confirms that what you’re seeing in Prodigy is consistent with the model. When you load in the data, Prodigy will use your model to score the examples – for NER, this is a little more complex, since there are so many options. But for text classification, all we need to do is check the doc.cats for the respective label. That score is the same value displayed with the annotation task.

So in this example, the model predicts 0.22 for zorg and around 0.06 and lower for everything else… which seems a bit strange? But it really depends on the training data. Maybe your model just hasn’t seen many similar texts? Or maybe you had examples of medical cannabis (zorg = care as in health care, right?) but none of illegal weed plantations? Or maybe something did go wrong, and the predictions make no sense at all.

But this definitely explains what you’re experiencing in textcat.teach: all predictions are low, so Prodigy starts by suggesting something, to see where it leads. Have you tried annotating a few batches (like 20-30 examples)? Do you notice any changes in the scores? Maybe it was just a few examples, and the scores will adjust after a few more updates. Maybe not, and in that case, the solution might be in the model training and architecture (as discussed in the other thread).

Yeah, we went back and forth on that decision and it wasn’t an easy one to make. I definitely see your point. In the end, we went with the more conceptual view that it’d be dangerous for Prodigy to make those kinds of assumptions quietly and behind the scenes. Even at the moment, a duplicate question is actually kind of difficult to define outside of the active learning-powered recipes with binary feedback. There’s a related discussion in this thread where I talk about some of the problems and potential solutions for manual annotation recipes like ner.manual.

Hey Ines,

[sorry about the thread confusion, I thought the other thread was dead, but then Matthew replied]

Yes, zorg is health care, and I guess models can sometimes mess up :). It did have decent accuracy when taking the largest class for each document, but I didn’t record what the winning score for each doc was.

We actually coded 500 articles each for economy and crime, but if I try to batch-train (so I could see if the scores are better for an example text) it gives an error, presumably because the model has 12 classes but we only coded 2:

$ ./prodigy textcat.batch-train topics_train initial_model_topics
Loaded model initial_model_topics
Using 20% of examples (202) for evaluation
Using 100% of remaining examples (811) for training
Dropout: 0.2  Batch size: 10  Iterations: 10  

#          LOSS       F-SCORE    ACCURACY  
Traceback (most recent call last):                                                                    
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/nel/prodigy_env/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/nel/prodigy_env/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/nel/prodigy_env/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/nel/prodigy_env/lib/python3.6/site-packages/prodigy/recipes/textcat.py", line 142, in batch_train
    loss += model.update(batch, revise=False, drop=dropout)
  File "cython_src/prodigy/models/textcat.pyx", line 174, in prodigy.models.textcat.TextClassifier.update
  File "cython_src/prodigy/models/textcat.pyx", line 192, in prodigy.models.textcat.TextClassifier._update
  File "pipeline.pyx", line 877, in spacy.pipeline.TextCategorizer.update
  File "pipeline.pyx", line 894, in spacy.pipeline.TextCategorizer.get_loss
ValueError: operands could not be broadcast together with shapes (10,12) (10,3) 

Should we code a small amount of texts (e.g. 50) for each class, run batch-train, and have a look at the resulting numbers?

Edit2: We started doing some coding for other labels, but the documents we get are identical for each label that we tried.

I also find it really suspicious that in the db-out given above the score/priority is 0.0406740457 for each label, which is the spacy model score for ‘wonen’ for the “wietkwekerij” example, but it has a different score for other topics. It should have .22 for ‘zorg’, for example, but here also reports .04:

image

Can there be some kind of mismatch between the spacy model labels and the prodigy labels?

Please let me know if you need anymore information to look at this. I can post a link to the initial_model and email a link to the db-out output, if desired?

Regarding that error: It seems to me you’re starting with a pre-trained text classification model there, and then trying to continue training — but with a different number of labels. I think that’s why it’s not working. You shouldn’t really need to resume training the text-classification model like that. It’s better to start from random initialisation each time, so long as you have all the training data available. The only time you really want to think about resuming training is when you’ve got a pre-trained model but not the source data, as happens with the NER, tagger, dependency parser etc.

I don’t understand this. Why should the score be 0.22?

I think that seems unlikely – otherwise everything would be much more broken!

One quick addition to @honnibal’s comment above: How many examples did you annotate before you got to the zorg example? If you annotated more than one batch, it’s theoretically possible that the score changed as the model in the loop is updated.

[reply deleted - I think I’m starting to understand my confusion, let me figure some things out and I’ll get back]

I still can’t get it to work :frowning:

First, quick answer to Matthew’s questions:

  1. I am continuing training of the model with the same categories, but we only coded 2 labels so far because we wanted to test improvement first. So, the extra annotations used a subset of the original annotations. We did it from a pre-trained spacy model because that’s what Ines suggested earlier [Help needed to get started with text classification, answer to question #2]

  2. I expected zorg to get a score of .22 because that’s what the spacy model predicts for that text (see above, my reply to Ines’ first question)

Please let me ask a clarification question:

I expected/assumed textcat.teach to work this way:

  1. Load the specified model
  2. Load all existing annotations in the dataset
  3. Update the model with existing annotations
  4. Select N most uncertain [preferably unseen] examples to annotate from training data
  5. User annotates examples
  6. Update model and repeat from step 4

However, I don’t see step 2 and 3 happening in the recipe, and it also seems to be too quick - it takes about a couple minutes to train a model on 500 examples with textcat.batch-train (10 iterations), but the textcat.teach starts within a minute - presumably enough time to apply to model to the unannotated training data, but not to actually update/train the model.

So, it seems that textcat.teach actually only does only this:

  1. Load the specified model
  2. Select N most uncertain [preferably unseen] examples to annotate from training data
  3. User annotates examples
  4. Update model and repeat from step 4

Q1) Is this correct?

Q2) If this is correct, that means that to continue training after breaking a session, I would need to batch-train the model from the earlier annotations and use that to continue training, right? Otherwise, my earlier annotations will not be incorporated in the active learning, right?

Edit: When I first read Matthew’s response, I thought his advise was to copy the annotations into the dataset before starting the textcat.teach, which would suggest that it does actually take pre-existing annotations. When I do this, I do indeed get different examples to code for different labels. However, they don’t seem very relevant, I would expect to have about 50% acceptance rate for a decent model, but the actual acceptance rate even for an “easy” topic like crime (with very characteristic words) seems close to random. So, another possibility is that it does train a model, but only with a single iteration - which in the batch-train produced a really low F-score [probably because 90% of decisions are reject, so it learns mostly reject as a baseline answer?] so that would explain both the speed and the seeming randomness. So:

Q3) If Q1 is incorrect, is there a way to specify training parameters (at least #iterations) for the initial model training based on existing annotations in the dataset (step 3)?

And finally, I still think something fishy is going on with the active learning.

I created a new dataset and used db-in to load all existing (pre-prodigy) annotations (each coding yielding one positive example and N-1 negative examples). I used batch-train to create a model, which reports an f-score around .6 and accuracy above .9, which I guess is somewhat decent for 12 possible labels, some of which have really low numbers.

If I run the created model on the example above, it correctly predicts crime as most likely class:

$ cat test_model2.py
import spacy
nlp = spacy.load('/tmp/model_met')
doc = nlp("Wietkwekerij opgerold in Kruidenwijk\n\nOp de Melkdistelstraat in de Almeerse Kruidenwijk is woensdagochtend een wietplantage opgerold. Dat schrijft een wijkagent op Facebook.\u00a0\u00a0 In totaal stonden er 168 planten in het huis. Alle spullen voor de plantage worden geruimd en de planten worden vernietigd.\u00a0 Er zijn twee mensen aangehouden.\u00a0")
print(doc.cats)
$ ~/prodigy_env/bin/python test_model2.py 
{'campagne': 0.5013799667358398, 'criminaliteit': 0.8852517008781433, 'cultuur': 0.05474354326725006, 'democratie': 0.05210911110043526, 'economie': 0.044103484600782394, 'integratie': 0.08611290901899338, 'milieu': 0.21252955496311188, 'onderwijs': 0.06845356523990631, 'uitgaan': 0.03143471106886864, 'verkeer': 0.03409619256854057, 'wonen': 0.16229450702667236, 'zorg': 0.09277740120887756}

Then, I started a new dataset again and did textcat.teach with two different labels:

./prodigy dataset empty1
./prodigy textcat.teach empty1 /tmp/model_met data/topics_train.jsonl --label criminaliteit 
./prodigy dataset empty2
./prodigy textcat.teach empty2 /tmp/model_met data/topics_train.jsonl --label economie

In both cases, the example article is given as first choice with score 0.5:

So, regardless of predictions from the used model, it gives the same first article with the same score.

This leaves me in confusion: if the active learning doesn’t start with the annotations in the current dataset, and I can’t continue training from a previous model, how can I either used pre-existing annotations, or continue the active learning process after quitting it?

I first need to catch up on the replies in this thread, but a quick answer to this one: In the active-learning powered recipes, the model you train in the loop is usually discarded – it’s always better to train a new model from scratch with multiple iterations etc. So you can always train a temporary model from the existing annotations and update it in the loop to find better examples to annotate than you would if you just labelled everything in your set. When you quit annotating, you throw away the temporary model and train a new one from scratch, using all annotations you collected.

The main advantage of training with a model in the loop is collecting better training data quicker. It’s totally possible that you’re working on a problem where this is less relevant and you want to label everything from scratch as it comes in.

So if I want to do textcat.teach on a dataset with existing annotations, what I should really do is first do a textcat.batch-train to create a new model, and then use that new model as the model parameter for textcat.teach (and presumably also --exclude the existing annotations)?

Yes, that sounds like a good strategy. And it might help to think of the model you’re using in textcat.teach as a temporary, throwaway model that mostly has one purpose: helping you collect better annotations by selecting better questions.

Right. But that brings me to my last reply above (the “something fishy”) one: a model trained from annotations doesn’t actually seem to be used in the active learning process (regardless of --label, the same document is given with the same .5 score, while the model predictions for that document are different…)