Best strategy for training an NER engine

I am working to build an NER engine and want to use my own data to train en_core_web_sm. I have several fairly large data files that I use for training, which is where prodigy’s nice annotation functionality comes in. However, I’m a bit unsure what the best approach is, which probably stems from a limited understanding of how the annotations function works.

My approach so far has been to create a single dataset, do a successive number of ner.teach runs, one for each of my training data files and each entity type that I’m interested in, and in the very end, when all the annotations have been completed, I use ner.batch-train on my single dataset (containing the annotations from all the separate runs of ner.teach), create-meta and all that to create a loadable spaCy module. However, when doing the annotations, I can see that prodigy often confronts me with examples that I have already dealt with many times before when annotating previous files, even though the annotations are being saved to the same dataset. So, say I’m running ner.teach on my first training data file and I annotate that ‘Disneyland’ is not a person 10 times, then prodigy will still confront me with examples asking me if ‘Disneyland’ is a person when I use ner.teach on the next training data file. It’s a bit unclear to me how the annotation process works, but it’s my impression that the annotations are entered into the dataset independently of each other, i.e. there is no active learning element in the annotation / ner.teach process itself, and it’s not until one calls ner.batch-train that the model is actually updated with the annotations. Is this right so far?

Assuming that it is, then I would suspect that it would probably be more useful to run ner.batch-train and save a model to a loadable spaCy module each time I have gone through a training data file, and then use this loadable module as the basis for the next run of ner.batch-train, instead of re-using en_core_web_sm. My presumption here is that my annotations from the first training data file will have been learned by the loadable module I’ve just created, and hence when using that module as my basis it will not ask me again if ‘Disneyland’ is a person, but instead focus on asking more interesting questions that build on that knowledge, so to speak. That would obviously lead to a fair bit of bookkeeping and take more time (since ner.batch-train takes a fair bit of time to run), in that I would then have to save each new ner.teach run in a separate dataset, for example (right?). But if the results warrant it, then that would be a minor issue, obviously. Have I understood this correctly? I’m a bit worried that I’ve missed some fundamental element of how prodigy works, so I would like to get clarification on this before I do a lot of potentially superfluous work.

Thanks for your detailed questions and sharing your use case!

No, the active learning component is actually part of the teach process and uses the model you're loading and keeping in the loop, which is then updated with your annotations. The model in the loop learns and improves as you annotate – and when you're done, you can create an even more optimised version of the model you were training using ner.batch-train with more iterations.

I think the problem you're experiencing happens because you start off with a "fresh" model (e.g. the default en_core_web_sm) every time you start the Prodigy server. So on each annotation run, you start with a model that hasn't learned anything yet – which is why it keeps asking the same questions. When you start Prodigy and add to an existing dataset, the existing annotations are not used to pre-train the model. Prodigy only creates a unique hash for each input example and annotated example, to make sure you're not annotating the exact same example twice in the same dataset.

We did consider pre-training on ner.teach in an early version of Prodigy, but decided against this feature, because it would easily lead to weird and unexpected results. Instead, ner.batch-train lets you create trained artifacts of the model, using whichever configuration you like. So for your use case, the steps could look like this:

  1. Start off with a "fresh" model and collect annotations for your dataset using ner.teach.
  2. Run ner.batch-train to train a model and ensure that it's learning what it's supposed to (it's actually really nice to have this in-between step – if it turns out your data is not suitable, or something else is wrong, you'll find out immediately and can make adjustments).
  3. Load the previously trained model into ner.teach, e.g. ner.teach my_set /path/to/model and annotate more examples for the same dataset. You don't need to package your model as a Python package – the model you load in can also be a path.
  4. Train a model again – starting with the "blank" base model, not your previously trained model! – and see if the result improves. Starting with a blank model that hasn't been trained on your examples is important, because you always want a clean state. (You also don't want to end up evaluating your model on examples it has already seen during training.)
  5. Repeat until you're happy with the results.

Using the same dataset to store all annotations related to the same project is definitely a good strategy. It means that every time you train and evaluate your model, it will train and evaluate on the whole set, not just the annotations you collected in the last step. This also helps prevent the "catastrophic forgetting" problem. If you keep updating a previously trained model but only evaluate on the latest collected set, the results may look great on each training run. But you have no way of knowing whether your model still performs well on all examples – or whether it "forgot" everything it previously learned, and now only performs well on the latest set of annotations.

3 Likes

Aha, I read your reply to mean that the active learning is local to each run of ner.teach, i.e. within each run of ner.teach prodigy actually does learn actively, so that if it just runs long enough prodigy will have learned that ‘Disneyland’ isn’t a person. But when I then save the annotations and make a new run of ner.teach that run does not use what was learned in the previous run, unless I update the model and use that model as the basis for the new run.

As far as I can see from what you write, I need to do a hybrid of what I have been doing and what I was thinking about doing, i.e. using a single dataset to store all my annotations, but using the updated / learned model (whether I store it as a loadable module or just use the model that resulted from the previous run of ner.teach doesn’t matter, but creating a loadable module every time seems like overkill) of the previous ner.teach run for the next run. I should then use ner.batch-teach to check my results regularly and then, when I’m done doing all the annotations I should compile everything into a loadable module.

(You don’t need to respond to this if I’ve understood you correctly - corrections would be appreciated though :slight_smile: ).

1 Like

Assuming this should also apply to textcat.teach I am following this workflow:

Starting with:
prodigy textcat.teach my_dataset en_core_web_sm data.txt
--label MY_LABEL --seeds my_seeds_for_this_label.txt

Checking results and creating a model:
prodigy textcat.batch-train my_dataset en_vectors_web_lg --output my_model_01 --eval-split 0.2

Continuing annotation:
prodigy textcat.teach my_dataset my_model_01 data.txt
--label MY_LABEL --seeds ./seeds/my_seed_for_this_label.txt

Unfortunately I am getting the same questions asked with a score:1, because it is a perfect match. I must be doing something wrong...

Are the same questions you're seeing based on the seed terms or the model, i.e. does it say via_seed in the bottom right corner? If so, this makes sense – the examples from the seeds are selected by simply matching against the text. So what it finds here will always be the same, no matter which model you're using and what it knows. If your model is already pre-trained enough, you could try to leave out the seed terms and just work with the model's predictions instead.

Yes, I see via_seed.

Should I annotate all in one session? Because it would not ingest the data. So that I could continue in another session.

If I add other labels, would prodigy use the data it already has in the dataset? My previous attempts created duplicate entries and the “total” shown in ui increased.

@cenk I think there might currently be a bug in the --exclude logic (will be fixed in the next release) – otherwise, this would probably be the best solution. You could then set it to exclude all annotations that are already present in the current dataset.

How large are the data files you're loading in? Another option could be to split your data into smaller sets and annotate one set per session, at least until you have enough annotations based on the seed terms. This might also speed things up a little!

(Some background on duplicate detection btw: When a new annotation task comes in, Prodigy assigns two hashes based on its content: the input hash referring only to the input content, e.g. the text, and the task hash, which is based on both the input and the added labels, entities etc. The task hash is then used to determine whether two tasks are asking the same question.)

Every accept, reject or ignore decision you make in the app will result in one entry in the database and dataset. This is by design, so you'll have a record of every individual annotation.

What exactly you mean by "use the data it already has in the dataset"?

I am working on a multi-label text classification project, consisting of ~2500 documents. These are like forum posts, some of them short some long. I will do NER as well and would like to have these combined in the same model.

When I think of a "dataset", I thought prodigy would ingest these in all at once and let me work on them. Since multi-labels should apply on each "record," new annotations would create new records behind the scenes (db) for each label?

Maybe I should work on the models seperately, combine them for training as a new set?

It would be great if you could give us some examples of continuous workflows. That could avoid duplicate efforts and teach newbies like me to how to use the product efficiently. :slight_smile:

Ah okay, I understand – maybe the "dataset" terminology is a bit confusing here, sorry about that. In Prodigy, a dataset is a set of annotations you create. (In very early prototypes, we called it a "project" btw – so if it's easier, you can also think of it this way.) So when you click accept, reject or ignore, the decision is stored in your dataset in the database. Each dataset should contain annotations specific to one task – e.g. text classification or NER. When you run a task-specific batch-train command, those annotations will be used to train the model.

When you load in your corpus for annotation, Prodigy will stream in examples as they come in and select the most relevant ones for annotation. By default, those are the examples that the model is most unsure about, i.e. the ones with a prediction closest to 50/50. To get over the "cold start problem", Prodigy will also mix in matches from your seed terms, to make sure you collect enough positive examples so that the model can start learning the right things.

If you're using Prodigy with the active learning component, this also means that the stream of examples will be different depending on the model you load in. If you don't want Prodigy to select examples for you, and instead, label every single example in your corpus in order, you can use the mark recipe with --view-id classification (see here). Considering your corpus is fairly small, this is definitely possible as well.

Prodigy focuses on collecting annotations for one label at a time and doesn't expect labels to be mutually exclusive (unlike entities). So you could, for example, start a new annotation session for each label and use a label-specific list of seed terms with the same input corpus. This means that you'll see the same examples twice – but with different labels. It's fine to add them all to the same dataset, though. When you run textcat.batch-train, all labels in the set will be added to the model and trained at the same time.

Sure! At the moment, we have the text classification workflow, the end-to-end video tutorial on training an insults classifier and a video tutorial on training a new entity type. If you haven't seen those yet, the videos in particular should be quite helpful to get a better understanding of a complete Prodigy workflow.

If there's anything you want to see in particular, definitely let us know. We're always open for suggestions. It's also much easier to produce useful content based on what users want to see, because what we come up with might not always be the most relevant use cases.