Train one label on a model that has two entities

Hi,

I have trained the model de_core_news_sm with two entities (ORG + MONEY). An I am storing all my data for that in one database entry (“ds_1”).

What is the best way to improve only one label? And does it matter if I do it manually (i. e. with ner.manual)?

Assuming I have a sentence with an ORG and a MONEY entity like this “CompanyA donates 10.000 Euro”. When I now train only the MONEY-entity how can I ensure that the model knows that this sentence also contains an ORG?

My idea was:

  1. ner.teach ds_1 --label MONEY OR ner.manual ds_1 --label MONEY
  2. ner.teach ds_1 --label ORG
  3. ner.batch-train ds_1 --label ORG,MONEY

What would be an alternative / better approach?

Thanks!

When you run ner.batch-train with the default configuration, all unannotated tokens will be considered missing values (instead of non-entities). So when the model is updated with your sentence, you'll only give definitive feedback about the tokens "10.000 Euro" (MONEY) – but not about any of the other tokens we don't have information about. If you've been using a binary accept/reject workflow, maybe your data also contains a "reject" annotation for the token "CompanyA". During training, all annotations on the same text will be combined and the model will be updated with the best analysis that's compatible with the annotations. So in that case, the feedback would be: "10.000 Euro is definitely MONEY, CompanyA is definitely not MONEY, but could be something else."

If you're interested in how updating with incomplete information works, you can see some more examples in my slides here.

If you don't want unannotated tokens to be interpreted as missing values and instead, treat them as non-entities, you can set the --no-missing flag during training. But this really only makes sense if your annotations are "gold-standard", i.e. if everything in that sentence is labelled.

Not really, no. The only potential problems would be that a) you might end up with only positive examples (accepts) and no negative examples (rejects), which might impact your evaluation and doesn't combine very well with other binary accept/reject annotations. And b) you might make your life harder than it should be, because manual labelling is more work overall.

If you're labelling manually, you might as well consider creating gold-standard data that contains all entities that are present in the data. This will let you update with --no-missing and potentially get better results. You could use the ner.make-gold recipe with your pre-trained model and only correct its predictions, which might be faster than doing everything from scratch. You could also use a workflow like this silver-to-gold recipe to convert your binary annotations to gold-standard annotations.

Thanks for your reply!

However, I am not sure I understood fully.

Regarding this quote; Did I understand the process correctly?

a. When I do a ner.teach for the MONEY-entity, the model knows that "10.000 Euro" is a MONEY-entity.
b. When I, after this teaching-session, do a second ner.teach, but this time for the ORG-entity, the model will/might ask me about that sentence again where I would be able to tag "CompanyA" as ORG.

And regarding this:

c. If I label two entities for the same sentence as in this example, would I have this sentence twice in the DB or will it get automatically combined (like below) when I save the annotations during the ner.batch-train process?

For instance when I look through my dataset (using ) I have found this:

Would the output of the steps a + b from above look something like this?

Thanks a lot for your help!

Yes, Prodigy will use the hashes it sets on the examples to determine whether two annotations were made on the same input (e.g. the text). When you run ner.batch-train, those examples will then be combined. So for example, let's say you've annotated both ORG and MONEY separately, and you then train. The following sentence might be in there twice, and it will be combined into an NER representation that looks something like this:

["CompanyA", "donates", "10000", "Euro"]
["U-ORG", "?", "B-MONEY", "L-MONEY"]

(U = unit, i.e. single token entity, B = beginning, L = last, ? = unknown). With more complex labels and accept/reject decisions, the whole set of options could look something like this. As you annotate, you're building up a set of constraints for the individual tokens. For instance, Prodigy might ask you if "donates" is an ORG, and you'd say no. We still don't know whethere "donates" is something else, but we now know that it isn't an ORG and can still update the model accordingly.

Yes, pretty much. Although, I wouldn't say that the "model knows" – but you're updating the model with that information. And when you train the model, you're trying to teach it to generalise based on that example. ner.teach will try to ask you about the entities that the model is most uncertain about – so you're not always necessarily seeing the examples with the highest confidence – instead, you're seeing the examples where your decision will likely make the biggest difference.

If you're still at the beginning and you feel like it's important to get enough positive examples in, you might try starting with a manual annotation session and actually label the correct examples first. If your model already predicts something, you can also use ner.make-gold, which lets you correct the model's most confident predictions by hand.

thank you very much - that now clarifies a lot for me!