We are new to prodigy and we want to do a NER project using it.
We are trying to use prodigy to build an annotation dataset with new custom labels. We already have a predefined regex pattern to label our data so we want to use this regex to help us during the annotation phase, the pattern is something like "every group of words that come after certain verbs". What's the simplest way to do it assuming we want to use a based transformer pretrained ner model ?
While the first step done, we want to get the dataset labelled but also the ner model finetuned by prodigy during the annotation. How can we access this and export the model ?
Hi! If you already have a regex that works for you, an easy solution would be to use it to add "spans" to the JSON data in Prodigy's expected format: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP Just make sure that the matches produced by your regex don't overlap and map to valid token boundaries.
It often makes sense to collect at least some annotations manually to get a feel for the data. If you're confident that your regex produces good result, you could also use it to pre-label your data automatically and train from it directly. After you have a model that predicts your custom labels, you can then use a recipe like ner.correct with --update to correct its predictions and update the model in the loop at the same time. This way, you'll see its updated predictions as you annotate. Finally, once you've collected a bunch of data, you can batch train your final model with all annotations to achieve the optimal results
One thing to keep in mind with transformers, though, is that they're often slower and require larger batch updates. So they might not be as sensitive to very small updates and less effective as a model to update in the loop. So you might want to use a smaller CNN model in the loop and later train your transformer-based pipeline on a separate machine with a GPU from all annotations.