Hi, I retrained the named entity recogniser to add a new entity (ACTIVITY) and added some custom components to the pipeline to improve the outcome of the NER later on. I packaged the model (including the factories for the custom components) and installed the packages in another venv. Everything works fine when I only use spacy or visualize the entities with displacy. However, when I use the model in prodigy (ner.correct) I get shifted annotations.
As prodigy shows the same sequence of annotations, but always only one word I suspect that somehow the merging of the tokens gets lost when I use the model in prodigy....but as it runs fine with spacy.load() I have no idea how to fix that.
Not sure this is the answer you're looking for but I noticed that prodigy rendering is sensible to the order of entities. So please make sure you task's have their span
s sorted in ascending token order. For good measure, you probably should make sure too that the token
s do too.
The thing is that I am using the prodigy ner.correct recipe with the custom model (including the custom pipeline components) on not annotated text. If I use the model on the same blank text to annotate the NEs and visualize the outcome with displacy it performs really well. But in the ner.correct recipe it doesnt. If you compare the two screenshots you will recognize that the sequence of annotations is the same, but the extend is not. Prodigy annotates always only one word, while in the displacy rendering of the annotations you will see that most of the annotations are multi word.
One of the custom components that I added to the pipeline merges tokens that belong to the same entity. So ent.end
is always ent.start+1
. I suspect that prodigy (I use v1.10.2) either uses its own tokenizer and therefore gets the annotations that come from the model wrong, or I did something wrong when I created the pipeline that only breaks when I use it with prodigy.
Ah, that would explain what's going on, yes! Prodigy calls into on nlp.make_doc
to generate a tokenized Doc
object (which typically delegates to nlp.tokenizer
) and doesn't run the full pipeline just to tokenize, which is usually unnecessarily expensive.
The best solution would probably be to just write your own adaptation of the ner.correct
recipe, e.g. based on this template:
But instead of calling add_tokens
, you'd just add a "tokens"
property to the task using the processed doc
object with named entities that's also used to add the entity suggestions. For example:
def get_token(token):
return {
"text": token.text,
"start": token.idx,
"end": token.idx + len(token.text),
"id": token.idx,
"ws": bool(token.whitespace_),
}
# in the make_tasks helper
task["tokens"] = [get_token(token) for token in doc]
That was the solution! Many thanks Ines!
Cool, glad it worked!
Btw, here's how I'm thinking about solving this going forward with spaCy v3: pipeline components in v3 are registered using the Language.component
or Language.factory
decorators that allow specifying additional meta information about the component, like the annotations it sets and requires (for pipeline analysis) and whether it retokenizes. So you could then mark your custom component that merges entities with retokenizes=True
and Prodigy could check this and default to running the whole pipeline to produce a tokenized doc. If no components retokenize, it would use the more efficient make_doc
.
That would of course be a very elegant solution. Looking forward to these changes.
Thanks for the update!