Finding little difficulty with NER. I believe the system finds verbs and part of speech from the English dictionary.. But for some reason very basic nouns ( which in most cases can't be anything else) were being tagged as verbs giving entirely different outcomes. Maybe we are doing something wrong, or any help will help.
Secondly, I can help out with writing any documentation or experiments/tests in case needed.
Hi! I'm not sure I fully understand the question or the problem – you mention verbs and nouns, but that'd be the part-of-speech tagger and not the entity recognizer? Those components are entirely separate in the spaCy pipeline.
How well a model performs or your data ultimately depends on how similar the data is to what the model was trained on. It's less about the English dictionary and more about the training data. If a model doesn't perform well on your data, you can fine tune it on more examples – for instance, using the pos.correct recipe to improve the part-of-speech tagger: https://prodi.gy/docs/recipes#pos-teach
Hi Ines, Thanks for the response.
We were trying to extract names from documents and spacy entities missed some of common proper nouns.. When we digged little further, we found part-of-speech tagger had marked the nouns as verbs. My question is that for this use case also we must train the system ? We used common example programs given in the spacy.io page with our sample texts.
I was under impression that 'part of speech' if not tagged properly, it will lead to wrong entity labelling.. May be my question is more related to spacy initial parsing. Pls correct and guide.
The part-of-speech tagger and named entity recognizer are separate and part-of-speech tags are not used as features in the NER model. If both the tagger and entity recognizer struggle with your example, it could be an indicator that your texts are quite different from what the model was trained on.
If you're building a system for a custom use case, you typically want to train your own model, yes. An arbitrary pretrained model you download will always be limited by the data it was trained on – typically some general-purpose corpus. Training a custom model is where NLP gets really powerful and lets you solve your specific problems.
Btw, note that this is the forum for our annotation Prodigy, not a general-purpose forum for spaCy. Topics here sometimes cross over, as Prodigy integrates spaCy, but we're not able to answer spaCy usage questions on here.