Updating an NER model using the annotation tool

I'm interested in grabbing one of the pre-existing language models from Space and retraining it to better perform NER tasks on the texts I'm using. After going through documentation and watching some videos, I'm still struggling to understand how to go through the whole process. For example: Which format should the data be in (csv, jsonl, etc.)? Should the data just be the text, or should it include other information? And so on.

Is there an article or video that I've missed that has a step-by-step example of training an NER model? For context, I've watched the video on training an NER model for ingredient tags and also gone through a good chunk of the NER documentation.

Any help is appreciated!

Have you read the getting started guide mentioned below?

In this guide, you'll be using the blank:en spaCy pipeline, which is a pipeline with just a tokeniser to annotate some examples. It's this command:

Then, after you've added some annotations, you'll notice that the same guides will have you run a train command. In this section:

The thing with that train command is that you're allmost there, you can check the documentation for the train recipe to check some of the settings.

In particular, notice the --base-model setting. This allows you to pass in an existing pipeline as a starting point. So effectively, you'd need to change the train command from this:

python -m prodigy train --ner new_news_headlines

Into this:

python -m prodigy train --ner new_news_headlines --base-model en_core_web_md trained_model

This will now create a trained_model folder, which contains a spacy pipeline that you can load via:

import spacy 

nlp = spacy.load("trained_model")

Does this help?

Hi Vincent, thanks for the response! Yes, I did go over all that documentation, but the way you lay it out makes it even easier to understand. I guess my main confusion is with the data. Can the data come from a csv file on my computer, or does it have to be in another format? I noticed the examples use jsonl, but I have more experience using csv, so I was wondering if it's possible to use it. Also curious about the data itself... can it be raw text or do I have to do any preprocessing on it? Thanks again!

The details on file formats can also be found on our docs here:

That said, CSVs are fine just but JSONL files are somewhat preferred. The main thing that's important is that you have the right keys/column names. The ner.manual recipe, for example, needs to have a "text" column around in order to render the annotation card.

All the recipes that Prodigy provide will handle any of the required preprocessing, it'll only become a concern once you start writing custom recipes. In that case though, you can check this section of the docs:

For each view_id that is used in a custom recipe, you can see what the expectations are in terms of input format.

Is this sufficient? Feel free to ask for more details.

Sorry, I still haven't managed to get it to work. I'm following the NER Manual recipe, and I'm inputting the following to start the annotations:

prodigy ner.manual es_core_news_sm /Users/modeedna/Desktop/INEGI/ProyectoNER/output_unescaped.jsonl --label LOC

I'm getting an error indicating that the source argument is required, which I thought was supposed to be the jsonl file with the text I want to annotate. Is the source meant to be something else? Also, if we don't select a database to store the annotations, it should be saving it to the .prodigy folder in the home directory, correct?

Again, thanks for the patience and help!

hi @ModeEdna!

You forgot to provide a name for your dataset that you'd annotate to, e.g., ner_data. It's a positional argument that follows right after ner.manual.

$ python -m prodigy ner.manual --help
usage: prodigy ner.manual [-h] [-lo None] [-l None] [-pt None] [-e None] [-C] dataset spacy_model source

    Mark spans by token. Requires only a tokenizer and no entity recognizer,
    and doesn't do any active learning. If patterns are provided, their matches
    are highlighted in the example, if available. The recipe will present
    all examples in order, so even examples without matches are shown. If
    character highlighting is enabled, no "tokens" are saved to the database.
    

positional arguments:
  dataset               Dataset to save annotations to
  spacy_model           Loadable spaCy pipeline for tokenization or blank:lang (e.g. blank:en)
  source                Data to annotate (file path or '-' to read from standard input)

So try:

prodigy ner.manual ner_data es_core_news_sm /Users/modeedna/Desktop/INEGI/ProyectoNER/output_unescaped.jsonl --label LOC

Correct! By default, Prodigy sets thePRODIGY_HOME to ~/.prodigy folder, which is where the prodigy.json (global) config file is (if you want to manually configure Prodigy) and the prodigy.db, which is a SQLite DB by default.

You can confirm this location by running prodigy stats and viewing the Prodigy Home location.

Awesome, it's working now! Thanks again! I really appreciate the quick replies :slight_smile: