Hey there!
First and foremost, great work here. I am new to ML / NLP / NER and am very happy you’ve come up with a solution to the “start from scratch” problem that has deterred me from experimenting in the past.
My use case
I am attempting to identify items in a domain that have a relatively specific syntax. There are ~9 attributes that make up a complete item. Will call them A1-A9. These are written in various ways and often, raw text will have ~4+ attributes which is enough detail to want to extract it.
I have had good success taking a seed terms list, building patterns relevant to the attribute’s syntax (using pattern matching and shape), and utilizing ner.match to identify annotations. Ex:
prodigy dataset ner_a1 "Attribute 1 details"
prodigy ner.match ner_a1 en_core_web_lg raw-data.jsonl --patterns a1-patterns.jonsl
I have done this for 3 attributes and have created 3 individual datasets, 3 terms list, and 3 patterns lists (with an attribute having a single ENTITY label A1,A2,A3, etc.). Overall, the ability to match the entity based on the patterns has been encouraging, and I have gone through a few hundred for each attribute using prodigy.
Question 1
It isn’t clear to me if I should be using a new dataset for each attribute or not. Note: these new entities are related to each other, and I suspect I will want to do something like entity_relations.py but I haven’t gotten that far yet.
Based on this post, it sounds like I should be using separate datesets but am hopeful for some clarification based on my use case.
Question 2
It is my understanding is that to create a better model and improve annotations, I should utilize ner.teach or ner.manual to add annotations with each entity together. ner.teach takes in a single dataset, single patterns (I think?) but multiple labels.
Should I be teaching/manually annotating each attribute seperately? If not, which dataset does that information belong to?
Question 3
To create the actual model I believe I am supposed to use ner.batch-train. With the multi-dataset questions above, it is unclear (to me) what I should be attempting. Is this correct?
prodigy ner.batch-train ner_a1 en_core_web_lg --output a1-model --label A1
prodigy ner.batch-train ner_a2 a1-model --output a2-model --label A2
prodigy ner.batch-train ner_a3 a2-model --output a3-model --label A3
Admittedly, most of the work thus far has been with utilizing prodigy’s CLI and not scripting with spaCy as I have been trying to build out the training data.
Please let me know if you can clarify my confusion! Also, I apologize in advance if these questions have been answered before.
Thanks again!