Hello, I'm looking for an opinion on what the best general Prodigy strategy would be for this particular situation:
Problem:
I have up to a page of text (usually a half of a page with 3-4 paragraphs) for each row in a CSV file. Each row in this CSV will comprise of 0 to 6 possible annotations that comprise part of or all of an entire paragraph (usually an entire paragraph is associated with one annotation). For example (each new line is a new paragraph):
Topic 1 โ Words about topic 1
Topic 2 โ Words about topic 2
Topic 3 โ Words about topic 3
This includes the situation:
Topic 1
Topic 2
Topic 1
Topic 2
But a paragraph can also look like this:
Topic 1. Topic 2.
Topic 3
Topic2. Topic 3.
and any permutation you can imagine.
As far as I can tell, there are three possible procedures using prodigy to obtain the best results:
Use traditional NER (ner.manual, ner.correct, etc.) and tag each paragraph that relates to each annotation.
Use the textcat.manual (multiple checkbox option)
Perform binary training for each individual annotation (will take forever, but will allow the model to focus)
My biggest concern is that based on my research, I am unsure SpaCy (and therefore Prodigy) is the strongest tool for classifying long paragraphs with large token spans. My general questions:
Is this a task that Prodigy/SpaCy can perform with relatively high F-scores? (80+%) assuming good input data? I know you don't know what the input data looks like, but a general opinion would be wonderful.
If so, which methodology would you generally recommend?
If not, do you recommend looking elsewhere (e.g., NTLK) that will result in better outputs for paragraphs length annotations?
I'm not sure I fully understand what you're trying to accomplish here but this doesn't sound like a NER task.
I'd split those paragraphs into sentences and classify them individually.
If you need to be more granular than sentences, you may want to look at the new SpanCategorizer (you'll need Prodigy Nightly to create training data for that) https://spacy.io/usage/v3-1#spancategorizer
I don't see why spacy wouldn't do well on this task but this isn't really a question anyone can answer without trying it out
Yeah, I think it really just comes down to finding the right annotation strategy and approach for breaking your problem down into machine learning components that make sense for the problem and let you learn most effectively from your data. If you've solved that, you'll be able to train a model and the choice of library or annotation tool kinda doesn't matter.
NER really isn't a good choice for this type of task. The spans you're looking for here are very arbitrary and don't always have clear boundaries, and they're likely also very uneven in length, and don't follow a consistent internal structure. What you're interested here is the topic expressed in a paragraph, sentence or sentence fragment, not a label that applies to a concept or proper noun given its local context. All of this is very different from what an NER model is designed to do.
If your texts can't easily be segmented according to topics, this is also something you need to factor in when you're designing your pipeline. If the task you come up with is inconsistent and hard, it'll be much more difficult to teach a model to perform it, and you're making your life a lot harder than it should be.
To me, the problem you're trying to solve here sounds more like a multi-label classification task: you can assign one or more labels to a given paragraph or sentence, and the model will be able to use the signals from the text to learn this distinction. This will also let you capture cases where the expressions that indicate the topic are spread across multiple phrases or sentence fragments (which is very common in natural language).
It's also important to keep in mind what you want to do with the predicted annotations in your downstream application. If you're just extracting arbitrary spans of text, what you get back are arbitrary spans of text. This is pretty unsatisfying to work with, because your results will have nothing in common and no consistent structure, there's often nothing meaningful to compute with that information. On the other hand, if you're predicting labels over the whole text, this gives you very consistent information that you can rely on in your downstream application.