prelabel data using regex and how to use the active learning functionality and get the model


We are new to prodigy and we want to do a NER project using it.

We are trying to use prodigy to build an annotation dataset with new custom labels. We already have a predefined regex pattern to label our data so we want to use this regex to help us during the annotation phase, the pattern is something like "every group of words that come after certain verbs". What's the simplest way to do it assuming we want to use a based transformer pretrained ner model ?

While the first step done, we want to get the dataset labelled but also the ner model finetuned by prodigy during the annotation. How can we access this and export the model ?


Hi! If you already have a regex that works for you, an easy solution would be to use it to add "spans" to the JSON data in Prodigy's expected format: Just make sure that the matches produced by your regex don't overlap and map to valid token boundaries.

Another option would be to convert your regex to patterns for spaCy's Matcher (which was designed to handle things like "words after certain verb"). You can then provide the patterns file via the --patterns argument of the recipe:

It often makes sense to collect at least some annotations manually to get a feel for the data. If you're confident that your regex produces good result, you could also use it to pre-label your data automatically and train from it directly. After you have a model that predicts your custom labels, you can then use a recipe like ner.correct with --update to correct its predictions and update the model in the loop at the same time. This way, you'll see its updated predictions as you annotate. Finally, once you've collected a bunch of data, you can batch train your final model with all annotations to achieve the optimal results :slightly_smiling_face:

Which recipe should be used for this step assuming we want to use bert embeddings to learn our ner.

You can use ner.correct to view and manually correct a model's prediction:

One thing to keep in mind with transformers, though, is that they're often slower and require larger batch updates. So they might not be as sensitive to very small updates and less effective as a model to update in the loop. So you might want to use a smaller CNN model in the loop and later train your transformer-based pipeline on a separate machine with a GPU from all annotations.