Multi-phrased labels for ner.teach

Hi,

I am trying to teach the model to pick out locations from data. For example,

“Oil discovery at the Njord field in the Norwegian Sea on 2nd March,1995.”

The location that I want the model to learn is “Njord field in the Norwegian Sea”. Spacy has an excellent LOC entity recognition model that gives me “Norwegian Sea” part of the location.

So I came up with a jsonl fine with such locations. For example,
{“label”: “LOCATION”, “pattern”: [{“LOWER”: “Njord”}, {“LOWER”: “field”}, {“LOWER”: “in”}, {“LOWER”: “the”}, {“LOWER”: “Norwegian”}, {“LOWER”: “Sea”}]}

The problem right now is that prodigy is not able to pick out the entire phrase while using ner.teach. Is the json representation incomplete or incorrect?

Can you advice on how to proceed in such a situation?

Thank you so much.

Jashmi

Hi Jashmi,

If you write {"LOWER": "Njord"}, the matcher is going to check token.lower_ == "Njord" – which will never be true! Try changing it to {"LOWER": "njord"}.

Btw, you might want to think carefully about your entity definitions. Even if the phrase you want to recover is "Njord field in the Norwegian Sea", it may or may not be a good idea to set that as your annotation (and model) objective.

If you label the whole “X in Y” phrase as an entity, you’re going to create a lot of difficult ambiguities for the model. Remember that the model is mostly thinking word-by-word, with a smallish window (3 or 4 words) around it as context. At each word, it has to decide on a tag.

The problem with a structure like “X in Y” is that it’s nested, not flat — so there’s not really any good limitation on the length. The disambiguating words towards the end of the phrase might occur very far from the decisive words at the end.

So, it’s possible you’ll be best off tagging this as actually two locations, and then having rules which extract LOC in LOC as larger entities if that’s what your later processing expects. The rule of thumb is that the dependency parser is specialised for building trees, while the entity recogniser is specialised for tagging flat phrases.

All that said, the best approach is ultimately empirical — it might be better to tag the whole entity as a LOC. It does depend on the data.

Thank you, Matthew! I used your Loc in Loc idea to extract larger location entities and it works like a charm!

I was wondering if this is possible if it doesn’t have a nested feature. What about cases when the entities are too long? For example I am trying to extract operator names from a doc and each operator name has four words in average such as : Talisman Energy Norge AS, RWE Dea Norge AS etc.

I used the following pattern names in the jsonl file:
{“label”: “OPERATOR”, “pattern”: [{“LOWER”: “talisman”}, {“LOWER”: “energy”}, {“LOWER”: “norge”}, {“LOWER”: “as”}]}
{“label”: “OPERATOR”, “pattern”: [{“LOWER”: “owe”}, {“LOWER”: “dea”}, {“LOWER”: “norge”}, {“LOWER”: “as”}]}

Is there a way to avoid ambiguity while training a model using such long phrases?

Thank you so much! :slight_smile:

Jashmi

Hi Matthew and Jashmi,

When I run ner.teach, Prodigy attempts to serve examples which the model is most uncertain about. This is excellent when it’s applied to just a few labels.

However, my model currently has a total of 12 labels. In the scenario that the model labels ALL entities in an example correctly, I’d click accept. If all my labels are wrong - I’d click reject.

Now, if my model’s labels are only partially correct, should I accept or reject the example ? And if I do reject this example, I’m concerned that the model wouldn’t be able to pinpoint exactly which label is wrong since I’m using multiple labels. I don’t wish for my model’s accuracy to suffer due to this.

Just to give some context on what I’ve done thus far. I’ve been following your suggestion on this link: Understanding ner.batch-train stats and best practices

I’ve gathered about ~10,000 examples from ner.manual and ner.make-gold. Almost all of these examples were ‘accepted’. I’d like to proceed with ner.teach soon to introduce ‘rejected’ examples into my model, but due to the problem mentioned above, I am unsure if ner.teach will be beneficial for my model.

Could you kindly suggest what would be the best way forward? Would love to hear from you soon. Thanks!