prelabel data using regex and how to use the active learning functionality and get the model

Rym · October 8, 2021, 2:40pm

Hello,

We are new to prodigy and we want to do a NER project using it.

We are trying to use prodigy to build an annotation dataset with new custom labels. We already have a predefined regex pattern to label our data so we want to use this regex to help us during the annotation phase, the pattern is something like "every group of words that come after certain verbs". What's the simplest way to do it assuming we want to use a based transformer pretrained ner model ?

While the first step done, we want to get the dataset labelled but also the ner model finetuned by prodigy during the annotation. How can we access this and export the model ?

Thanks

ines · October 12, 2021, 9:32am

Hi! If you already have a regex that works for you, an easy solution would be to use it to add "spans" to the JSON data in Prodigy's expected format: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP Just make sure that the matches produced by your regex don't overlap and map to valid token boundaries.

Another option would be to convert your regex to patterns for spaCy's Matcher (which was designed to handle things like "words after certain verb"). You can then provide the patterns file via the --patterns argument of the recipe: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP

It often makes sense to collect at least some annotations manually to get a feel for the data. If you're confident that your regex produces good result, you could also use it to pre-label your data automatically and train from it directly. After you have a model that predicts your custom labels, you can then use a recipe like ner.correct with --update to correct its predictions and update the model in the loop at the same time. This way, you'll see its updated predictions as you annotate. Finally, once you've collected a bunch of data, you can batch train your final model with all annotations to achieve the optimal results

Rym · October 14, 2021, 9:48am

Which recipe should be used for this step assuming we want to use bert embeddings to learn our ner.

ines · October 14, 2021, 11:18am

You can use ner.correct to view and manually correct a model's prediction: https://prodi.gy/docs/named-entity-recognition#manual-model

One thing to keep in mind with transformers, though, is that they're often slower and require larger batch updates. So they might not be as sensitive to very small updates and less effective as a model to update in the loop. So you might want to use a smaller CNN model in the loop and later train your transformer-based pipeline on a separate machine with a GPU from all annotations.

Topic		Replies	Views
How to use the spacy data to prodigy ner.manual and continue the annotation? usage , ner , spacy , custom	1	567	June 14, 2021
Updating an NER model using the annotation tool ner , spacy	6	397	June 5, 2023
Prediction model using prodigy trained model runs very slow ner , spacy	5	82	December 26, 2024
Prodigy to Spacy Guide ner , spacy , best-practices	4	5328	January 13, 2020
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018

prelabel data using regex and how to use the active learning functionality and get the model

Related topics