NER - Multi-entity and proper use of datasets


(Briggs Thompson) #1

Hey there!

First and foremost, great work here. I am new to ML / NLP / NER and am very happy you’ve come up with a solution to the “start from scratch” problem that has deterred me from experimenting in the past.

My use case
I am attempting to identify items in a domain that have a relatively specific syntax. There are ~9 attributes that make up a complete item. Will call them A1-A9. These are written in various ways and often, raw text will have ~4+ attributes which is enough detail to want to extract it.

I have had good success taking a seed terms list, building patterns relevant to the attribute’s syntax (using pattern matching and shape), and utilizing ner.match to identify annotations. Ex:

prodigy dataset ner_a1 "Attribute 1 details"
prodigy ner.match ner_a1 en_core_web_lg raw-data.jsonl --patterns a1-patterns.jonsl

I have done this for 3 attributes and have created 3 individual datasets, 3 terms list, and 3 patterns lists (with an attribute having a single ENTITY label A1,A2,A3, etc.). Overall, the ability to match the entity based on the patterns has been encouraging, and I have gone through a few hundred for each attribute using prodigy.

Question 1
It isn’t clear to me if I should be using a new dataset for each attribute or not. Note: these new entities are related to each other, and I suspect I will want to do something like but I haven’t gotten that far yet.

Based on this post, it sounds like I should be using separate datesets but am hopeful for some clarification based on my use case.

Question 2
It is my understanding is that to create a better model and improve annotations, I should utilize ner.teach or ner.manual to add annotations with each entity together. ner.teach takes in a single dataset, single patterns (I think?) but multiple labels.

Should I be teaching/manually annotating each attribute seperately? If not, which dataset does that information belong to?

Question 3
To create the actual model I believe I am supposed to use ner.batch-train. With the multi-dataset questions above, it is unclear (to me) what I should be attempting. Is this correct?

prodigy ner.batch-train ner_a1 en_core_web_lg --output a1-model --label A1
prodigy ner.batch-train ner_a2 a1-model --output a2-model --label A2
prodigy ner.batch-train ner_a3 a2-model --output a3-model --label A3

Admittedly, most of the work thus far has been with utilizing prodigy’s CLI and not scripting with spaCy as I have been trying to build out the training data.

Please let me know if you can clarify my confusion! Also, I apologize in advance if these questions have been answered before.

Thanks again!

(Ines Montani) #2

Hi! Nice to hear that your results have been promising so far :slightly_smiling_face: And those are all very valid questions. To some extent, how you set up your datasets depends on your personal preference and how you like to work. Datasets are intended as “related units of work” – this could be all annotations for one label, all annotations for one particular corpus or annotations you want to train a particular model on.

Using smaller datasets can sometimes be helpful in the experimentation phase, because it lets you iterate quicker and try out different combinations. For example, train on one label, train on two labels, train on all labels and so on. Or let’s say you’re training on labels A, B and C, and everything works well. But as soon as you add label D to the mix, it all goes downhill. This is a super valuable insight and will help you reason about what’s going on wrong. But it would have been much more difficult to achieve if you only had one dataset.

Merging datasets is always easier than splitting them. One thing you definitely want to avoid is mixing different types of annotations per dataset – like binary and manual annotations, or named entities and text classification labels. There’s no benefit of this and it’ll just make things harder because you won’t really be able to train from single datasets anymore.

If you have all the data, you can also merge the datasets into one and then train once, instead of updating the model with a dataset three times in a row.

Btw, dataset merging currently isn’t as smooth as we’d like it to be. There’s no out-of-the-box command for this at the moment – but you can always write a small script and access the datasets through the database methods in Python. For example:

from prodigy.components.db import connect

db = connect()  # uses the settings in your prodigy.json
datasets = ['ner_a1', 'ner_a2', 'ner_a3']  # names of your datasets

merged_examples = []
for dataset in datasets:
    examples = db.get_dataset(dataset)
    merged_examples += examples

db.add_dataset('ner_merged')  # however you want to call the new set
db.add_examples(merged_examples, datasets=['ner_merged'])

(Briggs Thompson) #3

Thank you very much for the fast response, Ines. You have cleared up what datasets represent conceptually.

I will spend some time on this and follow up with any further questions. I really appreciate your help!