I hope it works🤞 Sorry if this is still a little hard at the moment – as I said, that part is experimental. We're currently working on adding the required functions for this to the Prodigy core library. There'll be a helper that reconstructs the span-token indices, and the built-in NER recipes will also make sure to add them to the spans by default.
This means we could also add a --manual
or --edit
flag to the active-learning-powered recipes like ner.teach
. So you could just do ner.teach dataset en_core_web_sm my_data.jsonl --manual
and it'd show you the recognised spans, but make the task editable
Yes, this makes sense! I haven't tested it, but it sounds reasonable. Btw, here's a simple split_tokens
function you can use to add the "tokens"
key yourself:
def split_tokens(nlp, stream):
for eg in stream:
doc = nlp(eg['text'])
eg['tokens'] = [{'text': token.text, 'start': token.idx,
'end': token.idx + len(token.text), 'id': i}
for i, token in enumerate(doc)]
yield eg
Now your pre-annotated spans only need a startIdx
and endIdx
(the naming is bad, this will be changed in the next update!). So a span describing token 0 to token 4 will need "startIdx": 0, "endIdx": 5
.