Longer text highlighting in context

Hi! I’m a data-analyst with a small team and we’re looking for a tool to annotate text. I’ve just found the Explosion and Prodigy and watched a couple a tutorials. It seems a great product but not exactly what we need. I was wondering if You can tell me if I’m wrong about that, or maybe give me some directions.
What we need is a tool for highlighting (extracting) contiguous, one or more sentence-long fragments of a document. The decision to highlight or not to highlight could be binary, but it needs to be done in context of the whole document by an annotator and it is not necessarily a quick task.
The main goal is text-summarization (aside from document-level classification and recommendation).
Do You maybe have any suggestions?

Hi! You could totally do that type of annotation with Prodigy by using the manual highlighting interface with the labels you need and then streaming in longer texts. See here for the demo: https://prodi.gy/demo?view_id=ner_manual

At the end of it, you’ll get the texts and the character offsets of the highlighted spans, and can then use that to train your model or feed the annotations into some other process.

Some context on why we typically recommend annotating in smaller units: for many machine learning tasks, labelling at the document level can be quite counterproductive. If an annotator needs the entire document context to make a decision, a model is much less likely to learn from the annotations, because it typically has a much narrower context window. Even if you’re doing long text classification, most model implementations average over individual sentences or smaller units. So annotating whole documents at once makes it harder for the annotator and often creates data that’s very difficult to train with. But of course, if you know what you’re doing and you have an existing workflow that makes sense, you can obviously still collect annotations that way using Prodigy :slightly_smiling_face:

1 Like
  1. So it Is not a hassle to stream from a single document in a continuous manner?

  2. The scenario that I have in mind, is that the human finds a bunch of interesting websites and pdfs online, reads it and produces a summary. It would be OK to demand from the human to highlight pieces of text that he deemed especially relevant, but we don’t want to disturb his main task of producing the summary too much. Making the annotations would ideally happen sort of incidentally. Our lofty goal at this point is to create something of an intelligent document viewer, that would learn to automatically highlight important phrases, and to harness the human work into creating a dataset for text summarization.

It may be a lot ask for :slight_smile: But maybe You have some suggestions.

It does sound like what you’re looking to do is outside of the core use-case of Prodigy, which is mostly used to train and evaluate classification models. I’ve sometimes found unexpected uses for Prodigy (e.g. I use it to help me study German), but these are pretty much happy accidents.

It sounds to me like you’ll probably need a custom tool. I think you probably want to create the annotations in a view that keeps close to the original presentation of the document. You can make a custom front-end for Prodigy if you still want to use the backend functionality, and you can even extend the REST API with new methods if your custom front-end requires it. So it’s possible you’d find Prodigy to be a useful starting point, even with a fairly custom app. On the other hand, it can also be helpful to start from a clean slate — it depends on the application.

1 Like