training of annotated dataset with ner.make-gold

Hello,

After using ner.make-gold my annotations are as following:
Dataset sport_terms
Created 2019-08-05 14:11:37
Description Seed terms for SPORT label
Author None
Annotations 5048
Accept 4977
Reject 71
Ignore 0

While manually annotated, I've created some "reject" annotations to increase the accuracy of the model.

(My question might be really a basic one) but I could not understand why "prodigy ner.gold-to-spacy" saved only the "ACCEPT" annotations but not the "REJECT" ones?
:sparkles: Exported 4977 examples

So, does this mean that REJECT examples not making any difference for the model training?

Many thanks.

Rejected examples make a difference for the model if you're updating with incomplete annotations, e.g. binary yes/no decisions. It'll help help "narrow in" on the correct analysis, even if some values are missing. I've shared an illustration of this in my comment here:

If you're using a workflow like ner.make-gold and/or exporting the examples for spaCy later on, the general assumption is that your data is "gold standard". So basically, the annotated entities are all entities there are and all tokens that are not labelled are not part of an entity. That's why the recipe will only use the accepted answers: there's nothing to gain from the rejects, because it assumes that the accepted answers are complete.

Thank you for your prompt reply Ines,

My rejects are basically for the general terms like "sport" or "recreational activity".

Cause I'm only looking for the specific name of it like "aikido", "skiing".

So, do I have to write a recipe for gathering the rejected annotations from the ner.make-gold then?

Thanks

If your plan is to train with spaCy and no missing values, the accepted gold-standard examples should already cover what you're looking for. For example, take the following tokens and imagine you've labelled "aikido" as SPORT. The token-based entity tags would look like this:

["My", "favourite", "sport", "is", "aikido"]
["O", "O", "O", "O", "U-SPORT"]

"aikido" is a unit (single token) entity SPORT. All other tokens, including "sport" are outside an entity (O).

Yes my plan is to train with spaCy.

What I did so far actually creating a gold dataset for ner.teach first and then I plan to to ner.batch-train.

However, some sport terms could be more than one word though i.e. "scuba diving". So I manually annotated them.

For training I used the command below but got the "ValueError: Invalid pattern: ['Adapted Home', {'entities': []}]"
prodigy ner.teach sport_ner en_core_web_lg ../raw_data.jsonl --label SPORT --patterns gold_sportdataset.jsonl

@ines Could not I directly use the annotations from ner.make-gold?
I found your comment below, should I convert it into that format in order to use as input for ner.teach? If the answer is Yes, how should I define the span of empty entities?

Many thanks

Hi ,
I run that command "prodigy ner.teach sport_ner en_core_web_lg ../raw_data.jsonl --label SPORT --patterns gold_spanonlySPORT.jsonl" but got the error below. I used the annotations resulted from ner.make-gold which has only the spans this time. But I got the invalid pattern error again.

Using 1 labels: SPORT

Traceback (most recent call last):
File "/Users/../lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec)
File "/Users/../lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals)
File "/Users/../lib/python3.7/site-packages/prodigy/main.py", line 380, in controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/Users/../lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Users/../lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/Users/../lib/python3.7/site-packages/prodigy/recipes/ner.py", line 143, in teach
matcher = PatternMatcher(model.nlp).from_disk(patterns)
File "cython_src/prodigy/models/matcher.pyx", line 209, in prodigy.models.matcher.PatternMatcher.from_disk
File "cython_src/prodigy/models/matcher.pyx", line 136, in prodigy.models.matcher.PatternMatcher.add_patterns
File "cython_src/prodigy/models/matcher.pyx", line 60, in prodigy.models.matcher.create_matchers
File "cython_src/prodigy/models/matcher.pyx", line 29, in prodigy.models.matcher.parse_patterns
ValueError: Invalid pattern: ['Ski Equipment', {'entities': [[0, 3, 'SPORT']]}]

Could you please help me to understand how can I use the annotations that I got from ner.make-gold for the ner.teach train and later on for ner.batch-train?

The pattern gold_spanonlySPORT.jsonl look like:
["Fishing",{"entities":[[0,7,"SPORT"]]}]
["Football",{"entities":[[0,8,"SPORT"]]}]
["Go-karting",{"entities":[[0,10,"SPORT"]]}]
["Golf",{"entities":[[0,4,"SPORT"]]}]
["Hockey",{"entities":[[0,6,"SPORT"]]}]
["Horse trekking",{"entities":[[0,14,"SPORT"]]}]
["Horse riding",{"entities":[[0,12,"SPORT"]]}]
["Hot air ballooning",{"entities":[[0,18,"SPORT"]]}]
["Jet biking",{"entities":[[0,10,"SPORT"]]}]

I think there's some confusion here about the patterns. The ner.gold-to-spacy command outputs spaCy training data – e.g. texts with the highlighted entities in context. Patterns are abstract examples of entities that you want to highlight in the text. So a patterns might look like this:

{"label": "SPORT", "pattern": "Adapted Home"}
{"label": "SPORT", "pattern": [{"lower": "adapted"}, {"lower": "home"}]}

You can find more details on the patterns files in your PRODIGY_README.html. Given your gold_spanonlySPORT.jsonl, it should be pretty easy to convert those terms to patterns – you won't even need the character offsets, just the words. You might also want to check out spaCy's rule-based matching docs to read more about how those match patterns work.