I've been reading through the examples on nested NER annotation, and they suggest labelling the top level entities first, then export the annotation and do a second pass.
I've tried this using a custom recipe (after exporting the first pass to jsonl) but it's not working - it says no tasks available. Does that mean I have to use a different database for each pass?
Also am I correct in assuming that for the second pass I have to remove the original spans? When I used a different database it highlights the entities from the first pass and won't allow me to add the next level entities unless I delete the original annotation via the UI or remove it from the example in the recipe?
This likely happens because the hashes of the examples are the same, so by default, Prodigy will skip them so you're only asked about an example once. So one way to solve this is to re-hash, or disable excluding annotations from the current set (
We typically recommend using a new dataset for each new experiment / annotation type, though, as it makes it much easier to keep the data separate and start over if you have to. You can always merge your examples later.
Yes, for the second pass, the
"spans" should be empty, so you can start the next layer of annotations. You can still store the previous layer of spans in a different property on the same JSON object, and display them in a second block (e.g. using custom HTML) if you want to.
Btw, there's one feature I haven't implemented yet but that should make this type of workflow easier: just like the
image_manual interface, the
ner_manual UI should also be able to customise the key (e.g.
"spans") it reads from and writes to. This would let you have multiple blocks that show different layers of spans, provided by different keys in the JSON data. You could make both blocks editable so you can edit multiple layers at the same time, or make one of them use the static
ner to display reference annotations.
Thanks @ines, the new feature you mention would be excellent - what I really want to do is use something based on ner_manual to identify the top level spans. Then I pass each spans back for annotation which would involve identifying any sub-spans along with two
test_input blocks to request the type and id of the entity. The second step is recursive, operating on a single span at a time.
So what I'm doing is to extract the spans from the first pass and convert to examples with reference to the hash of the original doc. Then I pass through a second recipe, with a callback to add any newly created spans to the stream.
Being able to pass the key of a span to the interface would be nice. Also perhaps an option to display spans as underlines rather than highlights may work better for multi-layer annotation?
Is the intent to allow spans to nest in spans in the json? Or perhaps a flat list as it is now but have an optional parent key?
No, the spans would have to be separate lists, because they couldn't overlap – but you could then have two lists in your JSON, e.g.
"spans_entities" and two span annotation blocks that refer to the two lists. This would also mean that all lists of spans you create can be used independently to train a named entity recognizer out-of-the-box (or you can combine them later and use them in a different process).