I'm trying to find the best approach to extract requirements (spans) from documents. The requirements typically follow a list format with one item being a requirement, but they vary too much for regular expressions to handle. My spans do not overlap.
I started with spancat and typical spans are one or two sentences long but sometimes they can be multiple sentences. To deal with long spans I made a separate experiment with tagging the start and end of a requirement with NER. Both approaches give quite similar results with about 600 examples: Spancat scores are around 70 and NER scores at 80. Because I need both the start and end label to be correctly detected to extract a span, the performance of spancat is slightly better. The NER model trains a lot faster so that is one benefit. I haven't seen much info on the boundary tagging approach so any tips would be appreciated. I'm wondering if there should be a relation between the start and end labels and how to do that.
I ran debug data for the NER labels, and I got an error about labels crossing sentence boundaries. The start tag is usually an index and the end tag a period. Should i be concerned? And while running debug on the spancat data I got a warning that spans may not be distinct from the rest of the corpus, which means that the content of spans is not actually relevant?
Also while the scores are high the performance on new documents is really bad because each document follows a different format (different indices etc.), but I’m hoping the model can generalize with more training data. Currently I'm training on 5 different documents.
Thanks for your thoughtful question and welcome to the Prodigy community
That's a good question! Just curious - how prevalent are these examples? Is it only one or two times or is it pretty common?
If these warnings are only affecting a few examples, then I wouldn't be too worried. If there's a lot, then you may have some issues. One quick experiment - could you try to train once excluding those and a second time with them? I'd be curious on any performance degradation.
But I also want to see if anyone on the spaCy team has thoughts as the root of the question is really on spaCy. I'll post back next week if I hear back anything.
One open question: how long are your documents and how are you structuring your examples/records for Prodigy? Paragraphs?
You mentioned that "each document follows a different format". Can you provide more clarification? This could help inform us a bit more about your problem and possible options in training.
I have 156 entity span(s) crossing sentence boundaries out of 851 docs. I don't know how to exclude them from training because I don't really know what those are, but I would guess it's because there is a period inside the labeled entity, for example a lot of my start entities look like this "1.10.". Regarding the spancat data warning, I think the takeaway is that boundaries are more important than the content of the spans, and I was asking for confirmation.
My documents are about 20 pages long, and I split them at empty lines which gives me examples that roughly contain one span and does not split the spans. However, there are no newlines in my data and I think I'm losing valuable information this way and I should keep all newlines and manually split each document into paragraphs to annotate, so the model can use newlines as boundary information.
Mainly the different format refers to different indices, this is how my spans can look like in different documents:
1.10. This is a span in a document with numbered indices.
1.This is just a header and not a span
This is a span without any index and sometimes the period is left out too
INDEX - Sometimes the index is a word.
Req. 5.1-3, T2 Sometimes it is a combination of letters and numbers.
2.Sometimes they include sublists:
a) Like this
-with more sublists
I only recently ran into a document that only separates the spans by newline. Do you recommend annotating in a paragraph level to keep the newlines or is it better to stick to relatively short examples?
Sorry for coming at you with more questions! But regarding this question, is there a way to split at newline to get relatively short examples and keep the newlines at the end of the doc?
In general, we tend recommend shorter examples but I can understand if this is a bit tricky.
This is somewhat similar -- here's a quick way to split by a token and then create a new file to load as your source file (replace \xa0 with \n).
However, it doesn't do this under the constraint of keeping the newlines at the end of the doc. I'm wondering if there's a logic you could be as a if statement that would skip splitting when it is at the end.
Back to your original question - I was able to talk with a member of the spaCy dev team who suggested likely spancat would be a better fit. ner doesn't predict entities across sentence boundaries, especially given you have more than 100 spans which that's the case (nor is it easy to drop them).
Hope this helps and let us know if you have further questions!