Allow annotators to continue annotating later

We want to use Prodigy to label spans in long, complex text documents. For this task, it may be necessary for annotators to save intermediate results and continue working on them later on. Does Prodigy allow to do this?

I already setup Prodigy locally with multiple sessions, but unfortunatelly I didn't find a way to save intermediate results...

hi @cherrywoods,

Thanks for your question and welcome to the Prodigy community :wave:

What do you mean by "intermediate results"?

I'm a bit confused by your question because, given the same session, the same input (source) and saving to the same Prodigy dataset, Prodigy should continue back exactly where you ended your last session.

For example, let's say you have 100 documents. You start a Prodigy server, annotate 25 of those documents to a specific dataset, and save those 25 documents to the database. Then you stop the server to examine those results. If you restart the Prodigy server (and using the same session, source, and pointing to the same dataset), then yes, Prodigy should start back at the 26th document.

Perhaps it's worth mentioning that a lot of Prodigy's built-in recipes have a handy --exclude argument, which is a comma-separated list of dataset IDs containing annotations to exclude. This can be helpful if you save results into one dataset -- let's say dataset_round1. Then you want to start a 2nd round using the same input (source) file, but saving your 2nd round results into a new dataset, let's say dataset_round2. But if you pass the first dataset dataset_round1 as --exclude dataset_round1, it should exclude those results.

Can you provide more context (e.g., reproducible example) if you're seeing different?

Hi, thanks for the warm welcome.
I definitelly should have elaborated more on what I mean by "intermediate results". The point is, the documents we are annotating (or at least some of them) are so long that an annotator may need several hours for going over one document. Because it is suboptimal to require that this annotation process for a single document is uninterrupted (no page reload, but also no save) I was wondering whether it is possible to annotate part of one document, save the intermediate result on that document for coming back to it later and completing the annotation process.

Of course, splitting up the long documents in several shorter parts would solve this problem, but unfortunatelly, splitting the documents while avoiding to split spans that we would want to annotate across several pieces is non-trivial...

hi @cherrywoods,

Thanks for that background. That makes a lot of sense.

Even if you broke down the results by paragraph or sentence -- could you not load them in order so that the user is reading them similar to how they may read them? This way they can

For example, I wrote a previous post where I showed an example of going through the Prodigy docs paragraph by paragraph -- you can ignore the part on highlighting by sentence but the same idea holds.

We also have docs on other tips on working with longer docs like modifying the CSS or this link.

Can you give an example? If you split on paragraph or sentence, ideally that wouldn't split the spans.

Hi @ryanwesslen ,
thanks for your response. For the moment, I settled with this solution of splitting of the documents into sections using some heuristic. However, this will probably lead to the annotators having to manually split some documents for which the heuristic doesn't work. I will see how much manual effort that will be.

For the example - Unfortunately I can not provide any data and I feel the task description does not make much sense without data. Basically the problem is that spans can definitely contain multiple sentences, but paragraphs are still very long units of text (1000+ words) - potentially too long to work trough them in a short session (the task is also rather complex).

Probably Prodigy simply isn't designed for this kind of task, but we also couldn't find a better-suited tool for this...

Prodigy is designed to be customizable but it is indeed designed to see "one example at a time". This might feel opinionated, but the thinking is that it's easier to annotate a single, smaller, thing at a time. That typically leads to more annotations, usually of higher quality.

However, I'm currently working on a personal project that has a similar issue, so I figured that it might be helpful to explain the problem that I'm dealing with, together with the solution that worked for me. It might not perfectly translate to your problem, but hopefully it'll be a source of inspiration for another iteration on your end.

My issue

Many papers on arxiv do not interest me, but there is one kind of paper that I cannot get enough of ... it's papers about :sparkles: new datasets :sparkles:.

These are usually just plain amazing. They're creative, they help expand my understanding of possible use-cases and they're often publicly available too. Just to mention some examples: there's a dataset for text bubbles in comic books, one for text2fabric and a whole bunch that revolve around detecting things related to plants. All these papers were great reads.

But this begs the question, how does one find these articles? There's a public API for arxiv that gives you abstracts of text ... but these tend to be relatively long. Not huge, but long enough that sentence-transformer vectors have a somewhat hard time to add context. When there are 10 sentences, their context tends to average out. It also doesn't help that there is a class imbalance, most articles aren't about new datasets but also that an article can be about "a new benchmark on a dataset".

So my solution was to build a model that would detect if a sentence indicates that the paper might provide a new dataset. This reduces the problem so that it becomes easier to model. But as a happy side effect, it also makes it easier to annotate!

Three techniques

To annotate this data, I tend to rely on three techniques.

#1: Queries

When I download a new set of articles, I add them to a search engine that I've got running locally. That way, when I see a sentence that looks like it might be about a new dataset, say "this paper introduces a new corpus for ..." then I can use that as a query to find similar candidates. This can help me find similar sentences, which are usually sentences I'm interested in.

#2: Active learning

After a while I'll end up with some positive cases. It's also easy to just go through the sentences randomly to find some negative cases. But once I've got those two sets I might be more interested in examples where the model is uncertain. For that I train a sentence model and I re-use it to attach confidence scores to the most recent sentences that I've seen. Then I build a queue of examples to annotate where the model is the most uncertain.

#3: Second Opinion

After a while at this point I might have a model that does "OK". The only downside is that at this point I'm really just annotating sentences and it's very possible that I'm missing out by not looking at the abstracts. That's why, due to lack of a better term, I do a round of "second opinion" on my abstracts.

This involves me taking my most recent sentence model, attaching scores to all sentences and then retrieving the abstracts where there's one, and only one, confident sentence about a new dataset. The thinking is that, usually, if there's one sentence that strongly indicates a new dataset in an abstract ... there might be second one?

Here's a screenshot of that interface.

Notice how there's one example at the top that's highlighted? That's something my model did. Notice that last sentence in this example? That's something the sentence model skipped, but it's pretty easy to highlight.

Back to your problem.

It's fair to say that "arxiv abstracts" aren't the same as 1000+ word documents, so my "solution" may not translate perfectly. But I am wondering if you might be able to do something similar.

It might be possible to use a moving window of 10 sentences over the document, or to preprocess the document to generate paragraphs at a time. But you might also be able to build a sentence-level model to help you highlight interesting bits, is that something you might be able to do on your end? Another reason why I like using sentences is that the spaCy doc object can automatically split sentences for you via doc.sents. You can also do this ahead of time, so that you have an abstracts.jsonl file as well as a sentences.jsonl file on disk.

1 Like