Prodigy is designed to be customizable but it is indeed designed to see "one example at a time". This might feel opinionated, but the thinking is that it's easier to annotate a single, smaller, thing at a time. That typically leads to more annotations, usually of higher quality.
However, I'm currently working on a personal project that has a similar issue, so I figured that it might be helpful to explain the problem that I'm dealing with, together with the solution that worked for me. It might not perfectly translate to your problem, but hopefully it'll be a source of inspiration for another iteration on your end.
My issue
Many papers on arxiv do not interest me, but there is one kind of paper that I cannot get enough of ... it's papers about new datasets .
These are usually just plain amazing. They're creative, they help expand my understanding of possible use-cases and they're often publicly available too. Just to mention some examples: there's a dataset for text bubbles in comic books, one for text2fabric and a whole bunch that revolve around detecting things related to plants. All these papers were great reads.
But this begs the question, how does one find these articles? There's a public API for arxiv that gives you abstracts of text ... but these tend to be relatively long. Not huge, but long enough that sentence-transformer vectors have a somewhat hard time to add context. When there are 10 sentences, their context tends to average out. It also doesn't help that there is a class imbalance, most articles aren't about new datasets but also that an article can be about "a new benchmark on a dataset".
So my solution was to build a model that would detect if a sentence indicates that the paper might provide a new dataset. This reduces the problem so that it becomes easier to model. But as a happy side effect, it also makes it easier to annotate!
Three techniques
To annotate this data, I tend to rely on three techniques.
#1: Queries
When I download a new set of articles, I add them to a search engine that I've got running locally. That way, when I see a sentence that looks like it might be about a new dataset, say "this paper introduces a new corpus for ..." then I can use that as a query to find similar candidates. This can help me find similar sentences, which are usually sentences I'm interested in.
#2: Active learning
After a while I'll end up with some positive cases. It's also easy to just go through the sentences randomly to find some negative cases. But once I've got those two sets I might be more interested in examples where the model is uncertain. For that I train a sentence model and I re-use it to attach confidence scores to the most recent sentences that I've seen. Then I build a queue of examples to annotate where the model is the most uncertain.
#3: Second Opinion
After a while at this point I might have a model that does "OK". The only downside is that at this point I'm really just annotating sentences and it's very possible that I'm missing out by not looking at the abstracts. That's why, due to lack of a better term, I do a round of "second opinion" on my abstracts.
This involves me taking my most recent sentence model, attaching scores to all sentences and then retrieving the abstracts where there's one, and only one, confident sentence about a new dataset. The thinking is that, usually, if there's one sentence that strongly indicates a new dataset in an abstract ... there might be second one?
Here's a screenshot of that interface.
Notice how there's one example at the top that's highlighted? That's something my model did. Notice that last sentence in this example? That's something the sentence model skipped, but it's pretty easy to highlight.
Back to your problem.
It's fair to say that "arxiv abstracts" aren't the same as 1000+ word documents, so my "solution" may not translate perfectly. But I am wondering if you might be able to do something similar.
It might be possible to use a moving window of 10 sentences over the document, or to preprocess the document to generate paragraphs at a time. But you might also be able to build a sentence-level model to help you highlight interesting bits, is that something you might be able to do on your end? Another reason why I like using sentences is that the spaCy doc
object can automatically split sentences for you via doc.sents
. You can also do this ahead of time, so that you have an abstracts.jsonl
file as well as a sentences.jsonl
file on disk.