When I attempted to train the annotated data of Spancat on long sequences of text, I encountered the following exceptions. What could be the problems causing these exceptions, and what solutions can be implemented to overcome them? Additionally, what alternatives exist for annotating and segmenting data that is too long?
Have you seen this post?
You should consider modifying your suggested function.
Have you tried to identify any outlier text inputs? For example, try to train on half of your data. If it still doesn't train, then try on first 10%, etc. You may be able to find a few records that drive this.
If you run
data-to-spacy --spancat your_dataset then run
data debug, you can get some stats on your span characteristics.
If your pipeline contains a
spancatcomponent, then this command will also report span characteristics such as the average span length and the span (or span boundary) distinctiveness. The distinctiveness measure shows how different the tokens are with respect to the rest of the corpus using the KL-divergence of the token distributions. To learn more, you can check out Papay et al.’s work on Dissecting Span Identification Tasks with Performance Prediction (EMNLP 2020).
Since this problem is really a spaCy issue (not Prodigy), you may want to search and/or post on spaCy GitHub discussions for a solution, for example: