Prodigy Questions

chielingyueh · July 19, 2022, 7:16pm

Hi!

Currently, we are deciding whether to use Prodigy as an annotation tool for our project.
We have some questions, which we could not figure out.

First, is it possible to import vocabularies/taxonomy in Prodigy?

Second, how large can the size of a snippet be? We would have documents that can be up to ~300 pages, would Prodigy be able to handle this?

Third, are there keyboard functionalities possible for doing annotations? If yes, what kind?

Fourth, is it possible to have multiple annotations on one “entity”, i.e., is it possible to have overlapping spans? An example: “0.5%” is annotated as Dose, but “0.5% in four drops” is annotated as Dose Comment.

Thank you!

ryanwesslen · July 19, 2022, 10:32pm

hi @chielingyueh!

Welcome to the Prodigy community

Thanks for your questions and your interest in Prodigy!

Probably! What is the format of your vocabulary or taxonomy? Can you provide a bit more detail about your use case?

Out-of-the-box, Prodigy uses spaCy's pattern matching. This enables starting annotations with initial patterns like for Named-Entity Recognition. This can drastically help speed up annotations as Prodigy will provide matched patterns to the user to accept, modify or reject those patterns. Here's an example where providing patterns of fashion brands:

2022-07-19 17.55.56

One related recipe you may enjoy is the terms.teach recipe. This is a nifty recipe that aids in the creation building terminology lists (i.e., patterns or vocabularies). Here's a great tutorial video where we show how you can use this recipe to quickly build up a terminology list (i.e., vocabulary) for identifying ingredients in recipes (fyi the video uses a custom recipe for creating multi-word terminologies):

By snippet, do you mean the text shown to users? Technically, it can be very long. But is there any reason why you wouldn't want to break up 300 pages into something a little smaller, like paragraph level?

By default, Prodigy will use sentence splitters to break up longer text. This can make breaking up long documents like much easier for you. Although, sentence splitting can be removed as well if you did desire longer documents.

For example, this documentation provides a good background on handling longer text. Prodigy's flexibility enables customization of UI (e.g., custom CSS) where you can widen the card width like this:

But from a user-perspective, we typically advocate for simpler, shorter tasks to reduce cognitive load.

If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.

By default, Prodigy keeps a simple interface has a variety of default keyboard and swipe actions.

However, Prodigy's extensibility enables it to have custom keyboard shortcuts or even swipe actions on mobile devices.

Yes! Check out span categorization. Prodigy lets you label overlapping and nested spans and create training data for models and components like spaCy’s SpanCategorizer.

Here are the differences between Named-Entity Recognition and Span Categorization:

Named Entity Recognition	Span Categorization
spans are non-overlapping syntactic units like proper nouns (e.g. persons, organizations, products)	spans are potentially overlapping units like noun phrases or sentence fragments
model predicts single token-based tags like B-PERSON with one tag per token	model predicts scores and labels for suggested spans
takes advantage of clear token boundaries	less sensitive to exact token boundaries

Also, I would highly recommend watching my teammate, Edi's, new training video on using spaCy + Prodigy for spancat (or blog post too!):

Let me know if you have any further questions! If you have any specifics, feel free to email us at contact@explosion.ai as well. Happy annotating!

chielingyueh · July 20, 2022, 1:47pm

Hi @ryanwesslen!

Thanks for the reply

We would like to use a taxonomy while annotating. Such taxonomy would consists of synonyms and hierarchical structures of entities.
For example, tagtog allows you to import a dictionary (Dictionary TSV format · tagtog). We would like to know whether Prodigy could do something similar?

Thank you!

Chieling

ryanwesslen · July 20, 2022, 2:33pm

Thanks for your feedback! This is an interesting case. I'm not familiar with tagtog but glad you mentioned so I can learn more!

Yes! I found these posts that are relevant:

It's possible to add nested entities or synonyms if you converted the .tsv file to a dictionary with the nested entities (you could also do the same for the synonyms) like this:

# dictionary of lowercase entities mapped to subtypes
DRUG_SUBTYPES = {
    'citalopram': ['ANTIDEPRESSANT', 'SOMETHING_ELSE'],
    'lexapro': ['ANTIDEPRESSANT'],
    # etc.
}

Then you would follow the instructions to use spaCy to create a custom component to your modeling pipeline. I've posted a quick example of what it may look like:

gist.github.com

https://gist.github.com/wesslen/25f8f694ce82934d74912f873785b7a1

pokemondict.tsv

1	Bulbasaur	Fushigidane
2	Ivysaur	Fushigisou
3	Venusaur	Fushigibana
4	Charmander	Hitokage
5	Charmeleon	Lizardo

spacy_synonym_subtype.py

# Assume we have an existing pattern matching rule-based entity (could also be a trained NER). This entity only identifies five different Pokemon characters as POKEMON.

from spacy.lang.en import English

nlp = English()
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "POKEMON", "pattern": [{"LOWER": "bulbasaur"}]},
            {"label": "POKEMON", "pattern": [{"LOWER": "ivysaur"}]},
            {"label": "POKEMON", "pattern": [{"LOWER": "venusaur"}]},
            {"label": "POKEMON", "pattern": [{"LOWER": "charmander"}]},

This file has been truncated. show original

Two important points to think about. First, the model development isn't really Prodigy, but spaCy. Prodigy is the UI tool to get more annotations while spaCy is the NLP engine underneath. Prodigy does offer helpful training recipes but these are really running spaCy. To get the greatest/quickest gains with Prodigy, it's helpful to learn more about spaCy. Therefore, it seems like this question is really "can spaCy do this?" rather than "can Prodigy do this?".

It is worth noting that you can use Prodigy with other NLP/python libraries like TensorFlow or PyTorch, but that will require even more customization on the developer's part.

Related, what separates Prodigy from many other annotator tools is that it is a developer annotation tool. Prodigy is designed to be customized by your developers to write their own Python scripts to fit their unique needs (e.g., through custom recipes or custom interfaces). My favorite video that captures this design philosophy is this excellent talk titled "Let Them Write Code" by Ines:

Thanks again for your questions! Let us know if you have any further questions.

Topic		Replies	Views
Best way to prepare a long text for annotations usage , spacy , solved	4	2142	August 29, 2018
training a new entity type with Prodigy usage , ner	4	614	March 8, 2019
Newbie working with historical languages usage , ner , spacy	4	543	March 25, 2019
Annotate from dictionary API	1	121	February 9, 2024
spancat best annotations practices spancat	9	504	November 17, 2022

Prodigy Questions

Related topics