Hello there,
I am currently evaluating Prodigy (thanks Walter ) and testing out if it can help me to build a NER model that recognizes entities in my specific domain.
The sentences are in average 75 characters long and rather more like a "product description" than a gramatically qualified sentence with nouns, objects, adjectives, etc..
To start the experiment, I created a patterns file that contains both specific terms as well as wildcard rules that match measurements (e.g. 100 kg, 50 gramms, etc.).
Example of the labels:
- {"label": "COLOR", "pattern": "green"}
- {"label": "COLOR", "pattern": "blue"}
- ...
- {"label": "KG", "pattern": [{"LIKE_NUM": true, "OP": "?"}, {"LOWER": "kg"}]}
Then, I started the ner.teach recipy like this
prodigy ner.teach prodigy_test7 en_core_web_md ./prodigy_test_products.jsonl --label COLOR,KG --patterns ./prodigy_test_patterns.jsonl
The question
I would also like to train the model on SKUs. These kind of tokens have a characteristic that they are more or less random strings that do not follow any pattern (as they are defined by manufacturers based on their own decisions).
These tokens can be "anything" and they can be anywhere in a sencence.
Token examples
- 204A-XR-888
- 475 SSIG 06
- USR8IBNU4-SE.333K
- 063 890 4 210
I have thousands of SKUs available as a list, so I can create a patterns file. But will the model be able to generalize on this? It would be great to know beforehand, if this is realistic before starting with annotation.
Just a few minutes ago, I read this thread https://support.prodi.gy/t/help-first-process-of-annotation/4156/2 where Ines writes:
When using NER, make sure that your entity types still follow the same conceptual idea of "named entities", otherwise your model might struggle to learn them efficiently. They don't have to be
PERSON
orORG
, but they should work in a similar way and describe distinct expressions like proper nouns with clear boundaries that can be determined from the local context. If that's not the case, a named entity recognition model might not be the right fit for what you're trying to do. Instead, you might want to experiment with a hybrid pipeline of more generic and classic NER labels + a text classification model.
In case this also applies to my NER problem, what does Ines advice (hybrid pipeline) mean in practice? I am not that deep in the topic yet and her advice is still a bit abstract for me.
Thanks for all your help
Jens