annotating special terms

Hi,
I have special scientific data to annotate using prodigy. The difference with other data is that the annotated terms are mixed of letters and numbers and can appear in different written format

e.g., word word "Ex0091" word word or word word "Ex 0091" word word or word word "Ex0091", "Ex018" word word or word word "Ex 0091 and 018"  word word etc. 

What is the write way to annotate so the model can be effectively trained?

hi @fsa!

Thanks for your question! This is an excellent question so my colleague @Jette16 and I came up with a few best practices to help you and others who may have similar questions.

How to define entity boundaries

We're going to borrow this post that provides great guidelines on defining entity boundaries for a similar number/text mix (dates/times): Ambiguous NER annotation decisions - #2 by ines

it ultimately comes down to this: How do you want your application to perform? If you need to extract times and dates in a lot of different formats and then analyse and parse them, you probably want the model to only learn the exact spans. “hours” on its own is pretty useless. But if you mostly care about whether a text is about hours as opposed to minutes or seconds, regardless of the exact time span, teaching your model to ignore the numbers could also make sense.

Also, we like this advice too:

So when you come across an ambiguous example like this, a better way to think about it would be to ask yourself: “If my model produced this result, would I be happy about it and would it benefit the rest of my application?”

Choosing NER vs spancat

Since your entities have clearly defined boundaries, you'd likely want to use NER recipes rather than a span categorization. This documentation outlines the differences between ner and spancat.

Which Recipes to Use

The easiest way to annotate your special terms would be to use the recipe ner.manual. Furthermore, it seems that your terms could be found easily using patterns. You may find it to be helpful to create a patterns file that would allow Prodigy to pre-annotate (highlight) matched spans in advance, making your task easier to accept or correct rule-based spans.

You can store these patterns in a JSONL-file. Based on the example you gave, the pattern file could look like this:

{"label": "LABEL", "pattern": [{"TEXT": {"REGEX": "Ex[0-9]+"}}]}
{"label": "LABEL", "pattern": [{"LOWER": "ex"}, {"TEXT": {"REGEX": "[0-9]+"}}, {"LOWER": "and", "OP": "?"}, {"TEXT": {"REGEX": "[0-9]+"}, "OP": "?"}]}

Using the following command, you can then start annotating your data:

prodigy ner.manual dataset-name spacy-model ./text-file --label LABEL --patterns ./patterns.jsonl

with spacy-model being the name of the spacy model you want to use, e.g. blank:en, as well as ./text-file and ./patterns.jsonl being the paths to your text file and the JSONL that includes the patterns.

If you plan to use your own recipe, you could also use the PatternMatcher using the same patterns as above.

Accept or Reject Partial Suggestions

If you follow a typical NER workflow like the documentation outlines, after labeling with ner.manual and training an initial model, you may want to then use either ner.correct or ner.teach (active learning). If you use these "model-in-the-loop" recipes, make sure to only accept entities that include all the sufficient entity span (e.g., don't accept if for the span "Ex 0091 and 018" it only suggests "Ex 0091" but misses "and 018".).

There's details in the NER documentation that outlines it but here's one of the FAQ:

Should I accept or reject partially correct suggestions?

If you come across a partially correct suggestion – for instance the entity “Facebook Inc” with only “Facebook” highlighted as a suggested ORG – you should always reject them. The active learning-powered recipes will look at all possible analyses for the parse, so the correct boundaries are likely in there – it might just not be the suggestion you see first. By rejecting incorrect boundaries, you’re essentially telling the model to try again, moving it towards the correct boundaries. Each token can only be part of one entity, so if you accepted a partial match like “Facebook” in “Facebook Inc”, the feedback the model would get from this is “Yes, in contexts like this, ‘Facebook’ is a single-token ORG entity and wins over all other possible analyses containing this token.” That’s obviously not what you want.

We hope this helps you and let us know if you have further questions!

Hi Ryanwesslen
Thanks a lot ! for the detailed helpful explanations !