categorise documents for inclusion/exclusion

I have 970 files that I processed into 44000 lines on a jsonl file, strutured as

{"text": "1.1 text...", "label": "documentname1-date"}
{"text": "1.2 text...", "label": "documentname1-date"}
{"text": "1.1 text...", "label": "documentname2-date"}
{"text": "1.2 text...", "label": "documentname2-date"}

The first piece of work I need to do is to categorise the documents for inclusion/exclusion based on their contents (eg at the label level). I suppose I'd like to have this as a pre-processing model that I'd run in the future - as new documents become available.

I was planning to apply the results to the document level in the database i create from prodigy once i categorise all 970 files.

I thought I could do this with manual.spans with labels = x, not-x, y, not-y and a patterns file to help identify x and y. the context of the pattern appearing is important for whether it is x or not-x, so that's why I thought spans might be good - if i highlight the fuller context..

I've now annotated 200 lines this way and realise that this might be a bad setup. Prodigy keeps feeding me lines in order that they appear. Sometimes I'm clicking accept 50 times after I already have what i need from that document. I think my questions are:

  1. Does it makes sense to use spans in this way?
  2. Is there a way I can tell prodigy I'm done with a certain label while annotating?
  3. alternatively - can i have prodigy only show (at least initially) jsonl lines that include a pattern match?

Thanks for any thoughts you have.


I usually like to have a Jupyter notebook with some custom Python code that can create interesting subsets of the original .jsonl file. I then proceed by passing a subset.jsonl file to Prodigy. This is a personal preference, but I like how much control it gives me on what I label. You can certainly also create a custom recipe that filters data from inside of Prodigy as well, but I find it just a bit easier to make a subset in a notebook when I'm just getting started.

This is general advice, but in order to give more specific advice to your situation, I'd like to understand your task a bit better. What's the end goal you have in mind? Are interested in extracting spans or are you using spans to filter out documents that might not be of interest as a pre-processing step to label a classification task? Can you give an example that you want to filter out and an example of something you'd like to keep?

Thanks for the reply and helpful points.

I usually like to have a Jupyter notebook with some custom Python code that can create interesting subsets of the original .jsonl file.

It probably makes sense to do this for now. I was thinking of working through Span Categorization · Suggesters but this might be unnecessary if I just cut down my list to a reasonable number. But if I can do it in a better way, that'd be ideal.

The end goal for this step is to pre-process out documents that aren't relevant to my research. However, having a smart way to do this pre-processing is desirable as I plan to update the dataset over time as more documents become available. Also, I'll eventually add back in the documents I am currently filtering out, as I expand my research.

I want to keep submissions (which are described in a single document) that are [type of analysis] and are major (not minor).

  • Keep: "The submission presented a [type of analysis]"
  • Filter out 1: "The previous submission presented a [type of analysis]. The current submission presents a [different type of analysis]"
  • Filter out 2: "This minor submission presented a [type of analysis]

It's usually not as clear cut as the example I show in "Filter out 1" - there are many cases where the document talks about the type of analysis that I'm concerned with but actually uses a different type, so should be filtered out. i.e. "The submission should have used a [type of analysis]" should still be filtered out ...since it didn't use the type of analysis.

Sometimes, the submission can do multiple types of analysis, including the one I am concerned with, and I want to keep those. i.e. "The submission presented a [type of analysis] and a [different type of analysis]" should be kept.

"Filter out 2" is more straightforward. I think in 99% of cases, exclusion will be based on the use of the exact pattern of "minor submission".

Edit after approaching this again
My problem statement is very similar to the one here: Classifying long-documents based on small spans of text - Prodigy Support. I tried altering the code that @ines provided there to make a (my first) custom recipe but keep getting "Annotating non-existing argument: dataset".

Sounds good! I might advise to eventually also add some sentences where the pattern isn't detected though. We wouldn't want an algorithm trained on this data to "assume" that every sentence has the pattern we use to create the subset.

If you're interested in learning how to make a custom recipe, you might appreciate the recent videos that I've been making on Youtube. In each "episode" I try to start from scratch using the subsetting approach together with a custom recipe. I might particularly recommend checking the first video. The use-case in the video isn't exactly what you're doing, but the video does go in-depth in what Prodigy expects when you're making a custom recipe.

If after watching the episode you're still having trouble with your custom recipe, feel free to report back here :slight_smile: