categorise documents for inclusion/exclusion

zst · April 26, 2022, 1:53pm

Thanks for the reply and helpful points.

I usually like to have a Jupyter notebook with some custom Python code that can create interesting subsets of the original .jsonl file.

It probably makes sense to do this for now. I was thinking of working through Span Categorization · Suggesters but this might be unnecessary if I just cut down my list to a reasonable number. But if I can do it in a better way, that'd be ideal.

The end goal for this step is to pre-process out documents that aren't relevant to my research. However, having a smart way to do this pre-processing is desirable as I plan to update the dataset over time as more documents become available. Also, I'll eventually add back in the documents I am currently filtering out, as I expand my research.

I want to keep submissions (which are described in a single document) that are [type of analysis] and are major (not minor).

Keep: "The submission presented a [type of analysis]"
Filter out 1: "The previous submission presented a [type of analysis]. The current submission presents a [different type of analysis]"
Filter out 2: "This minor submission presented a [type of analysis]

It's usually not as clear cut as the example I show in "Filter out 1" - there are many cases where the document talks about the type of analysis that I'm concerned with but actually uses a different type, so should be filtered out. i.e. "The submission should have used a [type of analysis]" should still be filtered out ...since it didn't use the type of analysis.

Sometimes, the submission can do multiple types of analysis, including the one I am concerned with, and I want to keep those. i.e. "The submission presented a [type of analysis] and a [different type of analysis]" should be kept.

"Filter out 2" is more straightforward. I think in 99% of cases, exclusion will be based on the use of the exact pattern of "minor submission".

Edit after approaching this again
My problem statement is very similar to the one here: Classifying long-documents based on small spans of text - Prodigy Support. I tried altering the code that @ines provided there to make a (my first) custom recipe but keep getting "Annotating non-existing argument: dataset".

Topic		Replies	Views
Dynamically defining subset of labels to use in SpanCat usage , spancat	5	330	November 6, 2024
Using Prodigy to confirm or reject existing document labels usage , textcat , solved	2	617	January 5, 2019
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	177	January 14, 2025
what is best way to to extract paragraph or long sentences in a text document? usage	18	3696	August 9, 2020
Present span labels in groups in span classification task enhancement , usage , ner , custom , front-end	5	427	May 4, 2023

categorise documents for inclusion/exclusion

Related topics