I have 16 custom entity types, two of which share names with existing entity types (PERSON and DATE). When I train my model and load it in spacy to visualize annotations via displacy and apply it to a paragraph everything generally looks fine; however, if I apply my model to a single sentence from that paragraph it is massively over annotated (every token is being annotated including punctuation marks). Additionally types not included in my model, such as WORK_OF_ART or CARDINAL are appearing.
What might be causing this behavior and how can I avoid it?
I think you might be training your model starting with an existing NER model? If so, the model retains the annotations from the original model, which I think isn’t what you want.
The simplest solution to avoid this is to use the en_vectors_web_lg model as the model argument for ner.batch-train. This way it won’t start off with an NER model, so you won’t get the prior classes. If you don’t want the vectors either, you can start out with a blank model. The easiest way to do that is to save one to disk, like this:
Thank you for the response, I had tried using a blank model with no luck; however, I believe my issue was tied to having long documents that were being split into sentences.
Looking at the documents split by sentence, only half of the sentences contained annotations. I tried taking the split by sentence data and removing sentences that didn’t contain any annotations, then retrained the model. This solves the false positive issue and achieved reasonable performance.
I also tried training with --unsegmented. This also fixed the false positive issue, but model performance was as strong. I suspect annotation of my documents may be incomplete, resulting in annotated sentences that should have been annotated, but am curious about why this might result in excessive false positives in addition to poor general performance.
A related questions/observation: When running the model, it produces slightly different results depending on if I send single sentences or entire documents to the model. Is there a best practice for how large of segments of text should be used?