I’m using the NY times sample and trying to go thru your tutorial on NER. prodigy stops at 55% (123 out of 123), now what? How do I stop the classifier without killing the python process? Can I now use the data?
Yes, after you’re done annotating, you can hit the save button (or cmd+S) to make sure all state is saved, and then exit the Prodigy server. All annotations will then be added to your dataset, and you can use ner.batch-train
to train a model with the annotations. The model you’re training in the loop with ner.teach
is mostly intended to help select the most relevant examples for annotation – but because it’s only updated online, it’s not as good as a model you’ll train in batch over multiple iterations. So you can safely kill the process and discard the model in the loop, and then batch-train from the annotations.
The set of annotated examples we’ve provided is quite small, which is why you only got to 55%. You can always annotate more examples later on and add them to the same dataset – if you don’t have any data available, you can sign up for a free API key for one of the supported APIs.
OK, after spending an hour re-reading this forum and the readme file, I can see nowhere where you explain how to store an Live API key in the prodigy.json file, I have the NY times key and I’ve tried every format I can think of. Your workflow page says “To get started, pick one of the supported APIs, sign up for a key and add it to your prodigy.json config file.” How is this done? what’s the dictionary name? Your docs are great if you already know how your system works. But your examples are ALL self referential. You use terms you never define such as “task” and then explain how to use the live API in one. How about a complete WORKING example?
Sorry if this was unclear – you can find a full example of the "api_keys"
setting in your prodigy.json
on this page. Essentially, all you need to do is add an entry using the loader ID, i.e. "nyt"
, mapped to the key. I'm also happy to add a more specific examples of this to the PRODIGY_README.html
.
The Live APIs are mostly intended for testing purposes – e.g. to run your model over less predictable, real-world data (if you don't have any of your own at hand, or just quickly want to test something). In most cases, you probably prefer to load in your own data straight away. That's also why we've tried to structure the examples in a way that's not tied to the exact data we're using – ideally, users should be able to follow the same steps using their own data and a problem relevant to their use case.
This hasn't really come up before, but if it helps, we could add a "Prodigy glossary" to the website and PRODIGY_README.html
that explains the most important terms?
Edit: I've started compiling a list of some terms and their specific definitions and updated the PRODIGY_README.html
available for download, so you should be able to re-download it via your link. Also copy-pasting it here for reference, in case others come across this issue later. (Since it's copied over from the README, all links all refer to sections the HTML file).
Term | Description |
---|---|
annotation task | A single question you're collecting feedback on from the annotator. For example, whether an entity is correct or whether a label applies to a text. Internally, annotation tasks are simple dictionaries containing the task properties like the text, the entity spans or the labels. Annotation tasks are also often referred to as "(annotation) examples". |
annotation interface | The visual presentation of the annotation task. For example, text with highlighted entities, text with a category label, an image or a multiple-choice question. In the code, this is also often referred to as the view_id . See here for a list of available options. |
dataset | A named collection of annotated tasks. A new dataset is usually created for each project or experiment. The data can be exported or used to train a model later on. |
session | A single annotation session, from starting the Prodigy server to exiting it. You can start multiple sessions that add data to the same dataset. The annotations of each sessions will also be stored as a separate dataset, named after the timestamp. This lets you inspect or delete individual sessions. |
database | The storage backend used to save your datasets. Prodigy currently supports SQLite (default), PostgreSQL and MySQL out-of-the-box, but also lets you integrate custom solutions. |
recipe | A Python function that can be executed from the command line and starts the Prodigy server for a specific task – for example, correcting entity predictions or annotating text classification labels. Prodigy comes with a range of built-in recipes, but also allows you to write your own. |
stream | An iterable of annotation tasks, e.g. a generator that yields dictionaries. When you load in your data from a file or an API, Prodigy will convert it to a stream. Streams can be annotated in order, or be filtered and reordered to only show the most relevant examples. |
loader | A function that loads data and returns a stream of annotation tasks. Prodigy comes with built-in loaders for the most common file types and a selection of live APIs, but you can also create your own functions. |
sorter | A function that takes a stream of (score, example) tuples and yields the examples in a different order, based on the score. For example, to prefer uncertain or high scores. Prodigy comes with several built-in sorters that are used in the active learning-powered recipes. |
spaCy model | One of the available pre-trained statistical language models for spaCy. Models can be installed as Python packages and are avalable in different sizes and for different languages. They can be used as the basis for training your own model with Prodigy. |
active learning | Using the model to select examples for annotation based on the current state of the model. Prodigy, the selection is usually based on examples the model is most uncertain about, i.e. the ones with a prediction closest to 50/50. |
batch training | Training a new model from a dataset of collected annotations. Using larger batches of data and multiple iterations usually leads to better results than just updating the model in the loop. This is why you usually want to collect annotations first, and then use them to batch train a model from scratch. |