Updating an NER model using the annotation tool

ModeEdna · June 2, 2023, 1:19am

I'm interested in grabbing one of the pre-existing language models from Space and retraining it to better perform NER tasks on the texts I'm using. After going through documentation and watching some videos, I'm still struggling to understand how to go through the whole process. For example: Which format should the data be in (csv, jsonl, etc.)? Should the data just be the text, or should it include other information? And so on.

Is there an article or video that I've missed that has a step-by-step example of training an NER model? For context, I've watched the video on training an NER model for ingredient tags and also gone through a good chunk of the NER documentation.

Any help is appreciated!

koaning · June 2, 2023, 9:21am

Have you read the getting started guide mentioned below?

In this guide, you'll be using the blank:en spaCy pipeline, which is a pipeline with just a tokeniser to annotate some examples. It's this command:

Then, after you've added some annotations, you'll notice that the same guides will have you run a train command. In this section:

The thing with that train command is that you're allmost there, you can check the documentation for the train recipe to check some of the settings.

In particular, notice the --base-model setting. This allows you to pass in an existing pipeline as a starting point. So effectively, you'd need to change the train command from this:

python -m prodigy train --ner new_news_headlines

Into this:

python -m prodigy train --ner new_news_headlines --base-model en_core_web_md trained_model

This will now create a trained_model folder, which contains a spacy pipeline that you can load via:

import spacy 

nlp = spacy.load("trained_model")

Does this help?

ModeEdna · June 2, 2023, 1:14pm

Hi Vincent, thanks for the response! Yes, I did go over all that documentation, but the way you lay it out makes it even easier to understand. I guess my main confusion is with the data. Can the data come from a csv file on my computer, or does it have to be in another format? I noticed the examples use jsonl, but I have more experience using csv, so I was wondering if it's possible to use it. Also curious about the data itself... can it be raw text or do I have to do any preprocessing on it? Thanks again!

koaning · June 2, 2023, 1:28pm

The details on file formats can also be found on our docs here:

That said, CSVs are fine just but JSONL files are somewhat preferred. The main thing that's important is that you have the right keys/column names. The ner.manual recipe, for example, needs to have a "text" column around in order to render the annotation card.

All the recipes that Prodigy provide will handle any of the required preprocessing, it'll only become a concern once you start writing custom recipes. In that case though, you can check this section of the docs:

For each view_id that is used in a custom recipe, you can see what the expectations are in terms of input format.

Is this sufficient? Feel free to ask for more details.

ModeEdna · June 5, 2023, 3:20pm

Sorry, I still haven't managed to get it to work. I'm following the NER Manual recipe, and I'm inputting the following to start the annotations:

prodigy ner.manual es_core_news_sm /Users/modeedna/Desktop/INEGI/ProyectoNER/output_unescaped.jsonl --label LOC

I'm getting an error indicating that the source argument is required, which I thought was supposed to be the jsonl file with the text I want to annotate. Is the source meant to be something else? Also, if we don't select a database to store the annotations, it should be saving it to the .prodigy folder in the home directory, correct?

Again, thanks for the patience and help!

ryanwesslen · June 5, 2023, 3:32pm

hi @ModeEdna!

You forgot to provide a name for your dataset that you'd annotate to, e.g., ner_data. It's a positional argument that follows right after ner.manual.

$ python -m prodigy ner.manual --help
usage: prodigy ner.manual [-h] [-lo None] [-l None] [-pt None] [-e None] [-C] dataset spacy_model source

    Mark spans by token. Requires only a tokenizer and no entity recognizer,
    and doesn't do any active learning. If patterns are provided, their matches
    are highlighted in the example, if available. The recipe will present
    all examples in order, so even examples without matches are shown. If
    character highlighting is enabled, no "tokens" are saved to the database.
    

positional arguments:
  dataset               Dataset to save annotations to
  spacy_model           Loadable spaCy pipeline for tokenization or blank:lang (e.g. blank:en)
  source                Data to annotate (file path or '-' to read from standard input)

So try:

prodigy ner.manual ner_data es_core_news_sm /Users/modeedna/Desktop/INEGI/ProyectoNER/output_unescaped.jsonl --label LOC

Correct! By default, Prodigy sets thePRODIGY_HOME to ~/.prodigy folder, which is where the prodigy.json (global) config file is (if you want to manually configure Prodigy) and the prodigy.db, which is a SQLite DB by default.

You can confirm this location by running prodigy stats and viewing the Prodigy Home location.

ModeEdna · June 5, 2023, 3:36pm

Awesome, it's working now! Thanks again! I really appreciate the quick replies

Topic		Replies	Views
Getting Started Questions usage , ner	1	633	November 6, 2018
Commands for training NER-Model in prodigy usage , ner , solved , training	9	1120	January 9, 2023
How do I use prodigy as a purely annotation tool with no underlying SpaCy model? usage	1	1592	April 27, 2018
Prodigy annotations to SpaCy train spacy	13	5621	January 31, 2018
Spacy features - NER manual ? ner , spacy , solved	5	560	January 31, 2021

Updating an NER model using the annotation tool

Related topics