categorise documents for inclusion/exclusion

Thanks for the reply and helpful points.

I usually like to have a Jupyter notebook with some custom Python code that can create interesting subsets of the original .jsonl file.

It probably makes sense to do this for now. I was thinking of working through Span Categorization · Suggesters but this might be unnecessary if I just cut down my list to a reasonable number. But if I can do it in a better way, that'd be ideal.

The end goal for this step is to pre-process out documents that aren't relevant to my research. However, having a smart way to do this pre-processing is desirable as I plan to update the dataset over time as more documents become available. Also, I'll eventually add back in the documents I am currently filtering out, as I expand my research.

I want to keep submissions (which are described in a single document) that are [type of analysis] and are major (not minor).

  • Keep: "The submission presented a [type of analysis]"
  • Filter out 1: "The previous submission presented a [type of analysis]. The current submission presents a [different type of analysis]"
  • Filter out 2: "This minor submission presented a [type of analysis]

It's usually not as clear cut as the example I show in "Filter out 1" - there are many cases where the document talks about the type of analysis that I'm concerned with but actually uses a different type, so should be filtered out. i.e. "The submission should have used a [type of analysis]" should still be filtered out ...since it didn't use the type of analysis.

Sometimes, the submission can do multiple types of analysis, including the one I am concerned with, and I want to keep those. i.e. "The submission presented a [type of analysis] and a [different type of analysis]" should be kept.

"Filter out 2" is more straightforward. I think in 99% of cases, exclusion will be based on the use of the exact pattern of "minor submission".

Edit after approaching this again
My problem statement is very similar to the one here: Classifying long-documents based on small spans of text - Prodigy Support. I tried altering the code that @ines provided there to make a (my first) custom recipe but keep getting "Annotating non-existing argument: dataset".