Best Practices for Segmenting Text into Passages and Applying Multi-label Classification

Hello!

I am currently working with a large dataset of extensive text segments. My objective is two-fold:

  1. Segmentation: I want to allow annotators to manually segment these extensive text blocks into smaller "passages" or chunks.
  2. Classification: After segmentation, I want to apply multi-label, multi-class classification on each of these smaller passages.

Could you guide me on the best practices to achieve this workflow in Prodigy? Specifically:

  • Is there a recommended way to perform the manual segmentation in Prodigy, ensuring ease of use for the annotators?
  • Once the text is segmented, how can I set up a multi-label, multi-class classification task for each passage?

I appreciate any insights or references you can provide to streamline this process.

Thank you in advance for your assistance!

Best regards,
Yanir

hi @yanirmr!

In the docs, there is some details on handling longer docs into smaller docs:

You probably noticed that most of the examples on this page show short texts like sentences or paragraphs. For NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. Annotating with a model in the loop is also much faster if the texts aren’t too long, which is why recipes like ner.teach and ner.correct split sentences by default. NER model implementations also typically use a narrow contextual window of a few tokens on either side. If a human annotator can’t make a decision based on the local context, the model will struggle to learn from the data.

That said, there are always exceptions, and if you’re using the ner.manual workflow with whole documents, you can customize the UI theme to fit more text on the screen. For example:

{
    "custom_theme": {"cardMaxWidth": "95%", "smallText": 16}
}

You may find several older posts that are helpful too:

One other recommendation is that if you break up your data, be sure to preserve metadata for example {"text": "This is a sentence in the 2nd paragraph, page 0.", "meta": {"page": 0, "paragraph": 2}}. This is important because (1) it will it show to the annotator but (2) more importantly, it'll allow you to aggregate you data if needed after annotation.

Also, if you're worried about breaking up your text but want to provide slightly more context (e.g., annotate on sentence level but want to show paragraphs), I created this custom recipe:

textcat_sent_sequence

Since the paragraphs/sentences (docs) are in order of the original document, it mimics how someone may read from beginning to end of the document.

This is a bit tricky and there are several posts on this.

In general, we tend to recommend doing one label pass at a time. The docs mention these ideas:

If your annotation scheme is mutually exclusive (that is, texts receive exactly one label), you’ll often want to organize your labels into a hierarchy, grouping similar labels together. For instance, let’s say you’re working on a chat bot that supports 200 different intents. Choosing between all 200 intents will be very difficult, so you should do a first pass where you annotate much more general categories. You’d then take all the texts annotated for some general type, such as information , and set up a new annotation task to sort them into more specific subtypes. This lets the annotators study up on that part of the annotation scheme, so they can make more reliable decisions.

There are also a few support issues that provide examples:

But I had this recent post with a custom recipe that may do more of what you're looking for:

textcat-hierarchical

Hope this helps!

1 Like