Prodigy Questions

Hi!

Currently, we are deciding whether to use Prodigy as an annotation tool for our project.
We have some questions, which we could not figure out.

First, is it possible to import vocabularies/taxonomy in Prodigy?

Second, how large can the size of a snippet be? We would have documents that can be up to ~300 pages, would Prodigy be able to handle this?

Third, are there keyboard functionalities possible for doing annotations? If yes, what kind?

Fourth, is it possible to have multiple annotations on one “entity”, i.e., is it possible to have overlapping spans? An example: “0.5%” is annotated as Dose, but “0.5% in four drops” is annotated as Dose Comment.

Thank you!

hi @chielingyueh!

Welcome to the Prodigy community :slight_smile:

Thanks for your questions and your interest in Prodigy!

Probably! What is the format of your vocabulary or taxonomy? Can you provide a bit more detail about your use case?

Out-of-the-box, Prodigy uses spaCy's pattern matching. This enables starting annotations with initial patterns like for Named-Entity Recognition. This can drastically help speed up annotations as Prodigy will provide matched patterns to the user to accept, modify or reject those patterns. Here's an example where providing patterns of fashion brands:

2022-07-19 17.55.56

One related recipe you may enjoy is the terms.teach recipe. This is a nifty recipe that aids in the creation building terminology lists (i.e., patterns or vocabularies). Here's a great tutorial video where we show how you can use this recipe to quickly build up a terminology list (i.e., vocabulary) for identifying ingredients in recipes (fyi the video uses a custom recipe for creating multi-word terminologies):

By snippet, do you mean the text shown to users? Technically, it can be very long. But is there any reason why you wouldn't want to break up 300 pages into something a little smaller, like paragraph level?

By default, Prodigy will use sentence splitters to break up longer text. This can make breaking up long documents like much easier for you. Although, sentence splitting can be removed as well if you did desire longer documents.

For example, this documentation provides a good background on handling longer text. Prodigy's flexibility enables customization of UI (e.g., custom CSS) where you can widen the card width like this:

But from a user-perspective, we typically advocate for simpler, shorter tasks to reduce cognitive load.

If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.

By default, Prodigy keeps a simple interface has a variety of default keyboard and swipe actions.

However, Prodigy's extensibility enables it to have custom keyboard shortcuts or even swipe actions on mobile devices.

Yes! Check out span categorization. Prodigy lets you label overlapping and nested spans and create training data for models and components like spaCy’s SpanCategorizer.

Here are the differences between Named-Entity Recognition and Span Categorization:

Named Entity Recognition Span Categorization
spans are non-overlapping syntactic units like proper nouns (e.g. persons, organizations, products) spans are potentially overlapping units like noun phrases or sentence fragments
model predicts single token-based tags like B-PERSON with one tag per token model predicts scores and labels for suggested spans
takes advantage of clear token boundaries less sensitive to exact token boundaries

Also, I would highly recommend watching my teammate, Edi's, new training video on using spaCy + Prodigy for spancat (or blog post too!):

Let me know if you have any further questions! If you have any specifics, feel free to email us at contact@explosion.ai as well. Happy annotating!

Hi @ryanwesslen!

Thanks for the reply :blush:

We would like to use a taxonomy while annotating. Such taxonomy would consists of synonyms and hierarchical structures of entities.
For example, tagtog allows you to import a dictionary (Dictionary TSV format · tagtog). We would like to know whether Prodigy could do something similar?

Thank you!

Chieling

Thanks for your feedback! This is an interesting case. I'm not familiar with tagtog but glad you mentioned so I can learn more!

Yes! I found these posts that are relevant:

It's possible to add nested entities or synonyms if you converted the .tsv file to a dictionary with the nested entities (you could also do the same for the synonyms) like this:

# dictionary of lowercase entities mapped to subtypes
DRUG_SUBTYPES = {
    'citalopram': ['ANTIDEPRESSANT', 'SOMETHING_ELSE'],
    'lexapro': ['ANTIDEPRESSANT'],
    # etc.
}

Then you would follow the instructions to use spaCy to create a custom component to your modeling pipeline. I've posted a quick example of what it may look like:

Two important points to think about. First, the model development isn't really Prodigy, but spaCy. Prodigy is the UI tool to get more annotations while spaCy is the NLP engine underneath. Prodigy does offer helpful training recipes but these are really running spaCy. To get the greatest/quickest gains with Prodigy, it's helpful to learn more about spaCy. Therefore, it seems like this question is really "can spaCy do this?" rather than "can Prodigy do this?".

It is worth noting that you can use Prodigy with other NLP/python libraries like TensorFlow or PyTorch, but that will require even more customization on the developer's part.

Related, what separates Prodigy from many other annotator tools is that it is a developer annotation tool. Prodigy is designed to be customized by your developers to write their own Python scripts to fit their unique needs (e.g., through custom recipes or custom interfaces). My favorite video that captures this design philosophy is this excellent talk titled "Let Them Write Code" by Ines:

Thanks again for your questions! Let us know if you have any further questions.