Ambiguous NER annotation decisions


I’m running through my corpus to make annotations, and I’ve seen a few examples like these:

I know from your tutorial that it’s better to just ignore certain cases that are ambiguous. However, I’ve encountered the scenario above a couple of times. And I’ve never got an example that captures the right concept ( “2 hours”, instead of just “hours”). My initial thought was that I should ignore these cases, then I went over to discarding them (even though they are partly correct). What would be correct?

Also, would it be beneficial to have some form of correcting these error? My corpus is not especially standardized (date come in all types of forms, e.g. 10/2 - 18, 10.02.18, 10-02-18 etc.), making me feel like I’m not getting the most out of this tool without a certain amount of standardization.

Note! I’m new to NER process. Let me know if there is anything I’m missing.

Thank you!

1 Like

Thanks, this is a good question! The thing with NER (and most NLP applications actually) is that there’s no “objective truth”. It all depends on your application and the results you want to produce.

spaCy’s English models use the OntoNotes 5 scheme for NER annotations, so if you were following that scheme “hours” on its own would probably not be considered a TIME entity. So you would reject the example. In cases like this, rejecting is actually better than ignoring, because you’re explicitly telling Prodigy “no, this is wrong, try again”. There are only so many possible analyses of the entities and their boundaries, and by explicitly rejecting wrong boundaries, you’re moving the model closer to the correct ones.

However, it ultimately comes down to this: How do you want your application to perform? If you need to extract times and dates in a lot of different formats and then analyse and parse them, you probably want the model to only learn the exact spans. “hours” on its own is pretty useless. But if you mostly care about whether a text is about hours as opposed to minutes or seconds, regardless of the exact time span, teaching your model to ignore the numbers could also make sense.

Similarly, what your application considers an ORG or a PRODUCT doesn’t always need to match the underlying annotation scheme. I actually often find the pre-defined categories and definitions pretty unsatisfying for modern text (for example, is “YouTube” a PRODUCT? A WORK_OF_ART? Maybe it needs its own category PLATFORM?).

So when you come across an ambiguous example like this, a better way to think about it would be to ask yourself: “If my model produced this result, would I be happy about it and would it benefit the rest of my application?” The fact that your corpus is not perfectly standardised is actually a good thing – especially if your application is supposed to handle unpredictable text like user input. It’s also where a custom NER model is most powerful.


Thanks, it makes a lot more sense now! Is there a simple way of correcting annotations from a session?

I’m overwhelmed by the effort gone into answering people in the support forum, exquisite quality!


Thanks so much! We are aware that Prodigy introduces a lot of new (and sometimes quite surprising) concepts, and that users might have a lot of questions around the usage and best practices. So we’re trying our best to provide as much information as possible :blush:

Do you mean, change annotations collected in a previous session or dataset? You could export the existing dataset to a file using the db-out command, and then re-annotate it using the mark recipe, which will disable any active learning logic and will simply ask you about feedback on the exact examples, in order. You can then store the result in a new dataset:

prodigy db-out my_bad_dataset /tmp  # save dataset to a file
prodigy mark new_dataset /tmp/my_bad_dataset.jsonl  # reannotate exact data

If you’ve added multiple sessions to the same dataset (i.e. started the Prodigy server multiple times), each annotation session will also be stored as a separate session dataset, using the timestamp as its name – for example, 2017-11-21_03-33-39.

The session ID is printed after you exit the server. You can also find all session IDs by running prodigy stats with the flag -ls. A nice way to preview a session and check if it’s the one you’re looking for is to use the ner.print-dataset recipe (which gives you pretty output like this):

prodigy stats -ls  # show stats and list all datasets and session names
# pretty-print the session dataset to preview it (use -r flag to preserve nice colors )
prodigy ner.print-dataset "2017-11-21_03-33-39" | less -r

If you only want to annotate specific labels or examples, you might have to pre-process the exported file to only re-annotate parts of it, and then merge it all back together (there’s also a db-in command that lets you import files to a dataset).

The nice thing about JSONL is that it can be read in line by line, and is generally very easy to work with in Python. So you can write your own functions and scripts to structure your workflow however you like. (This is also part of the Prodigy philosophy btw – instead of giving you a parallel language and complex, arbitrary configuration API that you need to remember, Prodigy covers the basics and lets you plug in your own code and Python functions as custom recipes. Matt’s comment on this thread has some more details on this.)

1 Like

Sorry, my initial question was too vague. Let me clear things up! Sometimes when I annotate, I get into situations where the “phrase” is not contained by the label. This results in miss match between what I want to capture, and what the active learning component is suggesting.

For instance, instead of preferred suggestion:
… [ Bill Jerome Holmes PERSON ] …

I get:
… [ Bill Jerome PERSON ] Holmes …

Now, I’ve tried to fix this by rejecting these suggestions and hoping that it will figure it out by itself. However, this does not seem to be a good strategy (after 1000 annotation, on a batch-trained model, and then rerunning ner.teach).

That’s why I am now thinking of manually editing the start and end index of the annotations so it contains the whole phrase/noun. Is there any tool for this?

Thank you!

Ah okay, sorry! The ner.teach recipe works especially well if you’re looking to correct the entity predictions more generally – i.e. with the goal of having your application make less errors overall.

If you want to make more passes over the data and suggest analyses “until Prodigy gets it right”, check out the ner.make-gold recipe (see here for details). The recipe helps you create progressively more correct, gold-standard annotations by looping over the data, and suggesting different analyses based on the constraints defined by your previous annotations.

So, in an ideal case, the sequence would look something like this, expressed in the BILUO scheme:

  • "Bill Jerome Holmes", (B-PERSON, L-PERSON, O)REJECT
    (model: “Damn, could have sworn this was a person!”)
  • "Bill Jerome Holmes", (U-PERSON, O, O)REJECT
    (model: “Okay, fair enough… how about this?”)
  • "Bill Jerome Holmes", (B-PERSON, I-PERSON, L-PERSON)ACCEPT :tada:

Btw, in case you haven’t seen it yet, a good way to find out how your model is performing is to use the ner.eval or ner.eval-ab recipes. The examples you see during ner.teach are not always representative, because Prodigy tries to prioritise the ones it’s most unsure about, plus the ones that stand out, based on the already collected annotations. (This means it may skip examples with very confident predictions, especially those confirmed by previous annotations).

The binary interface is pretty important to the Prodigy experience and workflow, which is why there’s no feature to manually create entity spans and boundaries (for example, by clicking and dragging). So you’d have to do this manually – for example, by adding the correct annotation to your dataset:

{"text": "Bill Jerome Holmes is a person", "spans": [{"start": 0, "end": 18, "label": "PERSON", "text": "Bill Jerome Holmes"}], "answer": "accept"]}

If you’re looking for a tool that lets you click/drag/highlight/select, check out Brat. It’s more complex, but it’ll let you create exact entity spans and boundaries by selecting them. I don’t remember what the output format looks like in detail, but you should be able to easily convert it to Prodigy’s JSONL format, and add it to your data.

1 Like

Right, so that’s what ner.make-gold is for? I couldn’t quite grasp the make-gold terminology. Should have picked that up instead of letting it rub my mind every now and then. Hehe

This is awesome! Thank you, Ines!

1 Like

No worries. For NER, the gold-standard annotations are essentially the complete and correct set of entities on the text.

When you annotate with Prodigy, you’re usually only collecting annotations for one particular part of the document at a time. This is fine, because it still gives us plenty of gradients to train on, which will likely improve the model. But it also means that you won’t necessarily cover all entities that occur in the document, or entity-related information for all tokens (e.g. if the token is part of an entity or not).

For example, the gold-standard NER annotations for the sentence “Bill Jerome Holmes is a person and Facebook is not” would be:

('B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-ORG', 'O', 'O')

B = beginning of an entity, I = inside an entity, L = last token of an entity, U = entity unit (i.e. single-token entity) and O = outside an entity.

So let’s assume you’ve come across this sentence in Prodigy and select ACCEPT for “Facebook” as an ORG. The state of the gold-standard annotations will look like this:

('?', '?', '?', '?', '?', '?', '?', 'U-ORG', '?', '?')

ner.make-gold will keep iterating over your data and keep asking you questions about the contained entities, until it’s filled in all the blanks (marked with a ? in my example). All possible annotations have different probabilities, and for some of them, we already know that they’re invalid – for example, the token before U-ORG can’t be a B- token (i.e. the beginning of an entity), because that can only occur followed by a I- (inside) or L- (last). As you annotate, you also define more constraints that should – hopefully – narrow in on the one, correct solution.

Whether you really need gold-standard annotations for what you’re doing is a different question. If you’re creating a training corpus or evaluation data, you’ll likely want annotations that cover everything that’s in the document. If you just want to improve the model or the overall accuracy, you might be better off simply feeding Prodigy more examples and more data that it can learn and generalise from. This is also more fun and less tedious and going over the same data again and again.


Thanks for a detailed description of how ner.make-gold work!

I’ve started a session for making gold standard dataset and made one iteration over all annotations. Assuming that the 2781 annotations I imported to the dataset are covered by, as I write this, the 3038 binary decisions I’ve made with ner.make-gold.

However, I’m not seeing any improvement? Actually, I think it is giving me the same suggestions as before. Also, is it possible to scramble the suggestions a bit? I found it really frustrating when I saw the same mistakes over and over again, just with different context words. I’m annotating PERSON labels right now, and I keep getting some people who have middle names, very rarely does it get it right. All suggestions, on the exact same name, made the same mistake, just with different context words. Now that I’m passed the total number of annotations in dataset, should i expect it to adjust to rejected annotations? Can’t see that happening :stuck_out_tongue:

I’d like to add that my corpus is very poorly structured, context words don’t necessarily isn’t the best measure of word meaning in a sense. So I’m starting to think maybe these names should just be hard rules so to say.

Sorry, I can’t add any screenshots, because the names are sensitive.

Hope you understand! :blush:

Edit: I also see the exact same accepted annotations

I think maybe there is something wrong with what I’m doing, so I’ll just post my workflow to see if that’s the case.

First export all the annotations from my first dataset:
prodigy db-out long_text_annotations ./annotations/

Then I create my gold-standard database and import the exported annotations:
prodigy dataset long_text_annotations_gold "Gold standard annotations"
prodigy db-in long_text_annotations_gold ./annotations/long_text_annotations.jsonl

Now before I run ner.make-gold, I have the following datasets:

→ prodigy stats -ls

 ✨  Prodigy stats

  Version            0.5.0
  Location           /Users/my_user_name/Code/master_thesis/env_py3/lib/python3.6/site-packages/prodigy
  Prodigy Home       /Users/my_user_name/.prodigy
  Platform           Darwin-16.7.0-x86_64-i386-64bit
  Python Version     3.6.3
  Database Name      SQLite
  Database Id        sqlite
  Total Datasets     2
  Total Sessions     20

 ✨  Datasets

  long_text_annotations, long_text_annotations_gold

Then I run ner.make-gold for a few annotations (with a model from ner.batch-train):

prodigy ner.make-gold long_text_annotations_gold ./models/model-2-person --label PERSON

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

After i made 100 annotations I hit Save and then close the browser window, then terminate the web server, and I see this message:

Saved 102 annotations to database SQLite
Dataset: None
Session ID: 2017-11-24_13-50-57

And the following new stats:

→ prodigy stats -ls

 ✨  Prodigy stats

  Version            0.5.0
  Location           /Users/my_user_name/Code/master_thesis/env_py3/lib/python3.6/site-packages/prodigy
  Prodigy Home       /Users/my_user_name/.prodigy
  Platform           Darwin-16.7.0-x86_64-i386-64bit
  Python Version     3.6.3
  Database Name      SQLite
  Database Id        sqlite
  Total Datasets     3
  Total Sessions     20

 ✨  Datasets

  long_text_annotations, long_text_annotations_gold, 2017-11-24_13-50-57

Shouldn’t the annotations be saved to my long_text_annotations_gold dataset? and not as a new dataset?

Yes, the make-gold recipe is special in this way, because it keeps adding to the existing dataset and updates the individual examples with more entities (instead of saving each in annotated examples separately in the dataset).

The message it prints is a bit misleading here – I’ll think about a better way to solve this. Internally, the dataset is set to None to prevent Prodigy’s default behaviour and instead, all collected annotations are reconciled with the existing set afterwards in the recipe’s on_exit method. Prodigy should probably tell the user about this – will fix that!

About your problem / use case: As I said, I’m not sure you actually need the ner.make-gold workflow. The gold-standard annotations only seem to be a secondary concern for your use case – what you actually care about is improving the model’s predictions on your PERSON entities, right?

I am surprised it’s learning so slowly, though – I understand that you can’t share your data, but what kind of results are you getting during training? And how does the train curve look (via ner.train-curve)? As I said, the examples you’re seeing during ner.teach are not always representative, because Prodigy is asking you about the examples its most unsure about. So what’s more interesting here is the training results, and especially the improvements with different amounts of training data (e.g. 50% vs. 75% vs. 100%).

If it turns out that the pre-trained model is really struggling on “[first name] [middle name] [last name]” entities in your specific data, another strategy is to pre-train it with very explicit examples: see here for a simple training script. Your training data would focus on representative examples of those name entities and in addition to that, you’d mix in some other names in different formats to prevent the model from adjusting too much and “forgetting” all the other PERSON entities it previously recognised. Even a good selection of 20-30 examples can already have a big impact here to teach the model a better concept of 3-word person entities (i.e. B-PERSON I-PERSON L-PERSON) on your specific text. This should hopefully give you a better adjusted model, and let you overcome that “cold start” problem in Prodigy.

This is really helpful guidance about teaching the model to extend the NER spans. I’ve been coming across a lot of situations where the default model doesn’t get all parts of place names, especially when transliterated from Arabic. I had hitting ignore (because it was more right than wrong), but it sounds like it’s better to reject until it extends the span.

1 Like

I believe they have since added a workflow for manually entering spans for entities - it's convenient - just click to indicate span of NER.
The format is:
prodigy ner.manual [db-name] [model name] [path to jsonl w text field] --label [label name]