Hi, that's nice to hear! Welcome to the Prodigy community
Named entity recognition is especially powerful if you need to generalise based on examples of real-world objects and phrases in context. To achieve the best results, the category of things should be well-defined ā for example, PERSON
or CITY
are useful categories, while CRIME_LOCATION
or VICTIM
would be very difficult to learn ("victim" is not a category of people and "crime location" isn't a category of location ā it's all situational).
For some of the categories you describe, you might actually want to try a rule-based approach using spaCy's Matcher
(see here for details), especially if the phrases you're looking for follow a consistent pattern. You might also want to explore predicting broader categories and then using other features like the dependency parse to extract the information you need. For example, you could train a category BANK
, which would apply to "Bank of America" and then look for the syntactic parent (e.g. "debig card" or "account" etc.). See here for the visualized example:
I explain this approach in more detail in this thread. How you end up writing these rules obviously depends on your data, but I think you'll be able to achieve much better results this way than if you tried to predict fuzzy categories in one go.
If you haven't seen it already, check out @honnibal's talk on how to define NLP problems and solve them through iteration. It shows some examples of using Prodigy, and discusses approaches for framing different kinds of problems and finding out whether something is an NER task or maybe a better fit for text classification, or a combination of statistical and rule-based systems.
You might also find this video helpful. It shows an end-to-end workflow of using Prodigy to train a new entity type from a handful of seed terms, all the way to a loadable spaCy model. It also shows how to use match patterns to quickly bootstrap more examples of relevant entity candidates:
The en_core_web_sm
model is usually a good baseline model to start with: it's small, includes all the pre-trained NER categories, as well as the weights for the tagger and parser. Just keep in mind that if you do need some of the other pre-trained categories, you should always include examples of what the model previously got right when you train it. Otherwise, the model may overfit on the new data and "forget" what it previously knew.
If you don't need any of the other pre-trained capabilities, you can also start off with a blank model. In this example, the blank model is exported to /path/to/blank_en_model
, which you can then use as the model
argument in Prodigy.
nlp = spacy.blank('en')
nlp.add_pipe(nlp.create_pipe('ner')) # add blank NER component
nlp.add_pipe(nlp.create_pipe('sentencizer')) # add sentence boundary detector, just in case
nlp.begin_training() # initialize weights
nlp.to_disk('/path/to/blank_en_model') # save out model
The ner.batch-train
recipe supports passing in an --eval-id
argument. This is the name of the evaluation dataset the model is evaluated against. (If no evaluation set is specified, Prodigy will hold back a certain percentage of your training data ā but that's obviously a less reliable evaluation).
The evaluation dataset is a regular Prodigy dataset ā so you could repeat step 3 and use ner.manual
to label your evaluation data. If you already have a labelled set, you can convert it to Prodigy's JSON format and then use the db-in
command to import the data.
The ner.batch-train
recipe lets you define an --output
argument, which is the directory the trained model will be exported to. This directory will be a loadable spaCy model, so in order to use and test it, you can pass the directory path to spacy.load
. For example, let's say you run the following command to train the model:
prodigy ner.batch-train your_dataset en_core_web_sm --output /path/to/model --n-iter 10 --eval-id your_evaluation_dataset
You can then do this in spaCy:
import spacy
nlp = spacy.load('/path/to/model')
doc = nlp("This is some sentence with possible entities")
for ent in doc.ents:
print(ent.text, ent.label_)
How you set up the REST API is up to you. In general, it's recommended to only load the model once, e.g. at the top level (and not on every request). I personally like using the library Hug (which also powers Prodigy's REST API btw). Here's an example:
import hug
import spacy
nlp = spacy.load('/path/to/model')
@hug.post('/get_entities')
def get_entities(text):
doc = nlp(text)
ents = [{'text': ent.text, 'label': ent.label_} for ent in doc.ents]
return {'ents': ents}
For inspiration, you might also want to check out the spacy-services
repo, which includes the source for the microservices powering our demos and visualizers. If you like GraphQL, here's an experimental repo with a GraphQL API I built a while ago. (There's probably some room for improvement here, since I'm pretty new to GraphQL.)
One you have a model that predicts something, you can start by improving it by correcting its predictions. The most efficient way to do this is to use the ner.teach
recipe, which will show you the predictions that the model is most uncertain about, and will ask you to accept or reject the suggestions. As you annotate, the model in the loop is updated and its predictions are adjusted.
prodigy ner.teach your_new_dataset /path/to/model your_data.jsonl --label YOUR_LABEL
The idea here is to find the best possible training data that has the highest impact. In most cases, what you care about is the model's accuracy overall, not just the accuracy on some very specific examples. It's tempting to focus on single examples, but it's often useful to take a step back and look at the bigger picture.
If you want to label without a model in the loop and create a gold-standard training set, i.e. one that contains the full correct parse of the text, you can also use the ner.make-gold
recipe. It will stream in the model's predictions and let you edit them by hand. The idea here is that it's likely much more efficient and faster than doing everything from scratch, especially if the model gets a lot right already. If the model is correct 70% of the time, you only need to manually label and correct the remaining 30% (instead of doing 100% by hand).
If you annotate new data and want to update and improve the model, you do have to retrain it. This is usually a good thing, though, because it allows you to keep a clear separation between the individual model versions and the data the models were trained on, and it makes your experiments more repeatable. That's also why Prodigy generally encourages you to create separate datasets for every experiment you run. You should always be able to reproduce any given model state - otherwise, it becomes very difficult to reason about what's going on and how the annotations affect the predictions.
That said, there's still a lot you can automate! Prodigy is fully scriptable, so you could write a Python or Bash script that runs periodically, trains a model from one or more given datasets, outputs the model to a timestamped directory, writes out a file with all the config, compares the accuracy to the previous results and reports it back to you. If the model improved, you can then deploy it ā if not, you can investigate why the new data caused a drop in accuracy (Did the model overfit on the new data? Does the dataset include conflicting annotations? Did you introduce a new concept that was difficult to learn? etc.)