I have used this tutorial to train my model to recognize a new entity. I would like the model to recognize not only separate words, but phrases as well. Are named entities 1 word sized? Is there a way to generate my own custom training and testing dataset instead of uploading reddit corpus?
Also I used displacy to display entities from the text I uploaded. I got a pretty odd result:
Sure, that’s what Prodigy is for In the example, we’re using data we’ve downloaded from the Reddit Comments corpus as the input data, because it’s freely available and nice to work with. But you can use any text you have, in any format.
No, entities can consist of one or more tokens. One token can only be part of one entity – but many entities like person names or company names typically have several tokens. If you want your model to learn multi-word entities, you training data needs to contain enough examples of that.
Could you share an example of your training data and how you trained your model? What you’re seeing here can happen if you train from a pre-trained model but your new data didn’t contain any examples of any of the previous entities.