Clarifying Questions Regarding Auto-labeling (Active Learning, Ner.teach, Multiple Occurrences)


I just got my license today and still learning to use Excited to dive deeper into it! I am building a custom NER pipeline on PDF documents which I have already performed OCR to extract text on. I have also convert these txt files into JSONL format, and was able to use ner.manual to get a taste of what it is like to label on's UI. Beautifully-made tool!

Specifically, I am working on NER on electric bill documents, and I want to extract entities such as customer name, energy used, unit of measurement, billing dates, etc. I am trying to label my data as efficient as possible, and learned about using ner.teach to speed up the annotation process. However, I am running into a few problems:

  • Does support automatic labeling of entities with multiple occurrences? (sometimes the unit of measurement like kWh might occur many times within an electric bill, and I would not want to go through all of them and label one by one)
  • I realize you have to pass in a loadable spaCy model for ner.correct and ner.teach, how do I create a model that could recognize my custom entities? Do I first use ner.manual to work on a small subset of my dataset, save the model and use this model to do ner.correct and ner.teach for my larger dataset?
  • How do I match patterns for dates?

Thank you :slight_smile:

You can describe patterns that pre-fill entities in the NER recipes. Have you seen this section of the documentation? If you want to toy around with the patterns that are possible you may enjoy this online Matcher demo. You could, for example, use regexes here if you like.

If you want to match against dates you can either choose to use regexes in a pattern file, which I might recommend for your use-case, or I might leverage a pre-trained spaCy model (like en_core_web_md) to detect these dates. You can explore the available entities in the interactive demo here.

For the ner.correct and ner.teach recipes you can either pick a pre-trained spaCy model (like en_core_web_md) or a spaCy model that you've trained yourself. That means that once you've annotated some examples and run prodigy train that you can re-use the model that you trained here. You'll just point to a folder that contains the saved model (like path/to/model) instead of a downloaded one like en_core_web_md.

1 Like

Hi @koaning - thank you very much for your response!

Yes I looked at the pattern-matching functionality, and I have this pattern {"label": "EnergyUsedUOM", "pattern": [{"lower": "kwh"}]} in my json file for efficiently labeling the unit of measurement of energy, because the only unit I see in my documents is kWh.

I would then love to proceed to matching other patterns like dates. I tested the en_core_web_lg and found it helpful for recognizing various formats of date, like the following:

However, I can't use the en_core_web_lg model because it doesn't contain other custom entities that I need, right? Am I, however, possible to remove the unwanted entities from this model but keep useful entities like DATE for recognizing my entities for efficient labeling? Then I could just use ner.teach to teach my model to recognize other entities, right?

I am still a little confused about ner.teach and ner.correct, honestly. It seems like you would use them on a small fraction of your dataset, mainly for teaching your model to recognize pattern in entities. Once you have this model tuned, you would use it to label a the bulk of your dataset and it would quickly recognize the custom entities and data annotation would be much more efficient. Am I correct?

Thanks again!

However, I can't use the en_core_web_lg model because it doesn't contain other custom entities that I need, right?

You can choose to use en_core_web_lg together with patterns if you're interested in using the model to pre-annotate entities. If I recall correctly, the --label parameter will allow you only to select the entities that you're interested in.

Eventually, when you have enough examples, you may benefit from training your NER model based on your own training data. That would imply that, at some point, you'd replace the en_core_web_lg model with your own.

One thing to keep in mind is that ner.teach is a binary interface. It will show entities one-by-one which should make it much easier for you to quickly say "yes/no" correct. The ner.correct on the other hand, is a manual interface and it allows you to update the underlying model as you're annotating (via --update). That means that you'll still need to click and drag manually and that you should only accept when every entity in the sentence has been correctly annotated, but you will have a model in the loop which ner.manual doesn't have.

While ner.manual en ner.correct can be useful, it may be easier just to make your own "subset of interesting candidates" manually when you're starting out. I usually like to have a examples.jsonl file that contains all the items that could be annotated, but I use a Jupyter notebook to reduce it down to a useful-candidates.jsonl first. The reason why I like to work this way is that I am incredibly flexible in a Jupyter notebook and I usually have some heuristics at my disposal. Things like namelists, regexes, pre-trained ML models are all things I can use to reduce the original set down to a smaller one that should have interesting candidates.

This is a mostly a matter of preference, but you may enjoy this approach as well.