Hi!
Situation
I use a blank model to train a new NER entity, which is the only entity in the model. Input are spans of varying length (approx 1-15 tokens), which may or may not contain this entity (or in some cases the given span should be recognized as this entity entirely).
There is a set of words/tokens that will never be part of this entity. I used ner.manual
and ner.teach
to generate a small(?) dataset of 2000 entries.
Problem
This dataset contains many entries in which the "no-go" words/tokens mentioned above are explicitly rejected as single entities (e.g. commas, "|" or the word "Impressum"). Still, after batch-training my dataset to a new model (accuracy 85+%) and looking at it's predictions, it still often marks these as single token entities.
Question
How many examples explicitly rejecting these tokens do I need to avoid this behaviour? I know, I could add a filter in the pipeline, eliminating these abvious false positives, but I think this is not the way to go.
I found topics like Forcing NER to ignore stopwords that I could adapt to my question. Still, I was wondering if there is another possibility to tell the spacy/prodigy that some words can't be part of an entity in the newer versions.
A last addition: Is it possible to restrain the model in way that it can only predict ONE or NONE span of my entity type per input document? Like if there are plenty of entity guesses, only take the highest scoring one? Maybe this would already help with the problem mentioned before. How does one reach this internal NER entity "scoring" in general?
Thank you for your help!