I'm interested in grabbing one of the pre-existing language models from Space and retraining it to better perform NER tasks on the texts I'm using. After going through documentation and watching some videos, I'm still struggling to understand how to go through the whole process. For example: Which format should the data be in (csv, jsonl, etc.)? Should the data just be the text, or should it include other information? And so on.
Is there an article or video that I've missed that has a step-by-step example of training an NER model? For context, I've watched the video on training an NER model for ingredient tags and also gone through a good chunk of the NER documentation.
Hi Vincent, thanks for the response! Yes, I did go over all that documentation, but the way you lay it out makes it even easier to understand. I guess my main confusion is with the data. Can the data come from a csv file on my computer, or does it have to be in another format? I noticed the examples use jsonl, but I have more experience using csv, so I was wondering if it's possible to use it. Also curious about the data itself... can it be raw text or do I have to do any preprocessing on it? Thanks again!
The details on file formats can also be found on our docs here:
That said, CSVs are fine just but JSONL files are somewhat preferred. The main thing that's important is that you have the right keys/column names. The ner.manual recipe, for example, needs to have a "text" column around in order to render the annotation card.
All the recipes that Prodigy provide will handle any of the required preprocessing, it'll only become a concern once you start writing custom recipes. In that case though, you can check this section of the docs:
For each view_id that is used in a custom recipe, you can see what the expectations are in terms of input format.
Sorry, I still haven't managed to get it to work. I'm following the NER Manual recipe, and I'm inputting the following to start the annotations:
prodigy ner.manual es_core_news_sm /Users/modeedna/Desktop/INEGI/ProyectoNER/output_unescaped.jsonl --label LOC
I'm getting an error indicating that the source argument is required, which I thought was supposed to be the jsonl file with the text I want to annotate. Is the source meant to be something else? Also, if we don't select a database to store the annotations, it should be saving it to the .prodigy folder in the home directory, correct?
You forgot to provide a name for your dataset that you'd annotate to, e.g., ner_data. It's a positional argument that follows right after ner.manual.
$ python -m prodigy ner.manual --help
usage: prodigy ner.manual [-h] [-lo None] [-l None] [-pt None] [-e None] [-C] dataset spacy_model source
Mark spans by token. Requires only a tokenizer and no entity recognizer,
and doesn't do any active learning. If patterns are provided, their matches
are highlighted in the example, if available. The recipe will present
all examples in order, so even examples without matches are shown. If
character highlighting is enabled, no "tokens" are saved to the database.
dataset Dataset to save annotations to
spacy_model Loadable spaCy pipeline for tokenization or blank:lang (e.g. blank:en)
source Data to annotate (file path or '-' to read from standard input)
prodigy ner.manual ner_data es_core_news_sm /Users/modeedna/Desktop/INEGI/ProyectoNER/output_unescaped.jsonl --label LOC
Correct! By default, Prodigy sets thePRODIGY_HOME to ~/.prodigy folder, which is where the prodigy.json (global) config file is (if you want to manually configure Prodigy) and the prodigy.db, which is a SQLite DB by default.
You can confirm this location by running prodigy stats and viewing the Prodigy Home location.