Hello, I'm looking for an opinion on what the best general Prodigy strategy would be for this particular situation:
I have up to a page of text (usually a half of a page with 3-4 paragraphs) for each row in a CSV file. Each row in this CSV will comprise of 0 to 6 possible annotations that comprise part of or all of an entire paragraph (usually an entire paragraph is associated with one annotation). For example (each new line is a new paragraph):
Topic 1 – Words about topic 1
Topic 2 – Words about topic 2
Topic 3 – Words about topic 3
This includes the situation:
But a paragraph can also look like this:
Topic 1. Topic 2.
Topic2. Topic 3.
and any permutation you can imagine.
As far as I can tell, there are three possible procedures using prodigy to obtain the best results:
- Use traditional NER (ner.manual, ner.correct, etc.) and tag each paragraph that relates to each annotation.
- Use the textcat.manual (multiple checkbox option)
- Perform binary training for each individual annotation (will take forever, but will allow the model to focus)
My biggest concern is that based on my research, I am unsure SpaCy (and therefore Prodigy) is the strongest tool for classifying long paragraphs with large token spans. My general questions:
- Is this a task that Prodigy/SpaCy can perform with relatively high F-scores? (80+%) assuming good input data? I know you don't know what the input data looks like, but a general opinion would be wonderful.
- If so, which methodology would you generally recommend?
- If not, do you recommend looking elsewhere (e.g., NTLK) that will result in better outputs for paragraphs length annotations?