Span annotation with ner.manual -- how to make use of ner.teach

I think you somehow ended up with slightly messy datasets that mix annotations of different types and from different processes. Ideally, you want to create a separate dataset for separate annotation experiments. If you mix annotations from say, ner.manual (fully manual, all entities gold-standard, no missing values) with ner.teach (binary, only one span at a time, all other tokens missing values), and put them all in the same set, you won't be able to train a useful model with that because there's no way to tell which examples are gold-standard and which aren't, and you might even have a buch of conflicts.

I'd recommend just exporting the data you have, go through it in the JSON file or using a Python script and see if you can clean it up a bit. The _view_id of each record will tell you the ID of the annotation interface, so you probably want to separate examples created with ner(binary) and ner_manual (manual). Each example will also have an _input_hash so you can identify annotations created on the same input text. You can also call prodigy.set_hashes(examplee, overwrite=True) on each example to make sure you have no stale hashes, and then use the _task_hash to find duplicates.

1 Like