Hi,
Can we train multiple entities at the same time. So that model can catch different entities ?
Sure! During annotation, you can always speficy one or more labels via the --label
argument – for example, --label PERSON,ORG
. We usually recommend focusing on a smaller label set per session and running smaller experiments during development – but once you’re ready to train your final model for production, you should ideally merge your data and include all the labels you’ve annotated, so the model can learn them all at the same time.
Hello,
can you help me with, “how to merge all our data into a single datasets?”
Is there any efficient way?
Thanks
If you want to stay in Prodigy and train from within Prodigy, you could export all the data you want to use using db-out
and then import it all into a new dataset using db-in
. Or, more efficiently, you could just write a script. This also lets you add data from other sources, if needed.
from prodigy.components.db import connect
db = connect()
examples1 = db.get_dataset('dataset_one')
examples2 = db.get_dataset('dataset_two')
db.add_dataset('new_dataset')
db.add_examples(examples1 + examples2, datasets=['new_dataset'])
If you want to train with a different library or just with spaCy directly, you can do something similar – get all the examples from the datasets you want to use, format them however you need and write them to a file. Each example also has an _input_hash
that describes the original input text the annotation was collected on. So examples with the same input hash are annotations on the same text. So if you want to merge the annotations manually, you can find examples with the same input and then merge the "spans"
(for NER).
Hi,
we added 7 entities in a dataset the way you suggested & while doing review label tagging only few entities are coming up. The initial tagged entity are not coming. what could be the reason for it?
Did you include examples of those entities in your training data as well? If you update an existing model with new categories, it's important to also "remind" the model of what it previously got right. You can do this by processing text with the existing model you want to update, selecting the entity spans you want to "keep" and including those in your training data when you update the model.
You can find more details and solutions if you search for "catastrophic forgetting":
https://support.prodi.gy/search?q=%22catastrophic%20forgetting%22
Thanks for the speedy reply.
we are still at review level tagging on prodigy.
Initially we just did the skincare entity on a review dataset, now we added 5 more entities (merged entity dataset). The skincare entity is not reflecting on the review dataset. Does it comes under same problem?
Hi,
I posted couple of messages few days back, but did not receive any reply. My question relates to how to combine multiple entities together - should it be the case of merging the annotations and training the data set on the new annotations? Or is there some other way if we’ve already trained each and every model separately for a particular entity? The thread about the merging of annotations did not seem too clear to me, hence had posted a message there.
Vatsala
Hi @vatsala
I'm sorry we couldn't get to your questions, but please understand that our time to reply is quite limited over weekends. We're happy to provide free licenses to some researchers, but they don't also come with 24/7 support!
To answer your question: it should be possible to have multiple NER models in a single pipeline in spaCy, although you have to make sure the entities don't overlap. Alternatively, you might find it helpful to use the ner.make-gold
workflow to create annotations on your data for the entities you're interested in, based on the previous model. If the accuracy of the existing model is already good on your data, you should be able to build a useful training set quite quickly this way. You can then add the annotations for the new entity, PLANT
, so that you can train on all of the entities together.
Hi Matthew,
Thanks for the reply, I am sorry if my message came across as persistent, that was not my intention. Thank you for providing the free license, please know that I am actually trying to evaluate the tool so that I can propose it to a group of historians who would like to extract information from textual sources. We also have a plan to purchase the tool, but before that I would like to complete the prototype I am working on and be able to demo it.
Regarding the multiple NER models, I have managed to train a model on a merged dataset (identifying plantnames, and titles), all seems to work well with the ner.print-stream on prodigy. But when I load the model on spacy, although it does identify the entities concerned, there are other entity names such as MONEY, LANGUAGE that are also coming up.
Is there any way round this or can this simply be ignored?
Thanks,
Vatsala
If the new categories are predicted with good accuracy, you can ignore the other labels that are predicted and/or just filter them out. For example:
plant_labels = ['PLANT', 'SOME_OTHER_LABEL']
plant_ents = [ent for ent in doc.ents if ent.label_ in plant_labels]
print(plant_ents)
However, if you don't actually need them, you might get better results if you start with a blank entity recognizer and only train it on your new labels.
Here's an example of how you can create a model with a blank entity recognizer while keeping the existing tagger and parser. I've added some comments, so you can see what's going on.
import spacy
# Load the base model and replace the existing NER with a blank one
nlp = spacy.load('en_core_web_sm')
nlp.replace_pipe('ner', nlp.create_pipe('ner'))
# Initialize new weights for NER but not for tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
nlp.begin_training()
# Save model to disk
nlp.to_disk('/path/to/model')
If you don't care about the tagger and parser either, the code is simpler:
nlp = spacy.blank('en')
nlp.add_pipe(nlp.create_pipe('ner'))
nlp.begin_training()
nlp.to_disk('/path/to/model')
The /path/to/model
directory will contain a loadable spaCy model, which you can also use in Prodigy. So if you run ner.batch-train
, you can do something like this:
prodigy ner.batch-train your_dataset /path/to/model ...
Hi Ines,
Thanks very much for your reply. I’ll have a go at the blank entity recognizer.
Best,
Vatsala