training long sequence on spancat memory problem

kushal_pythonist · March 29, 2023, 5:37pm

When I attempted to train the annotated data of Spancat on long sequences of text, I encountered the following exceptions. What could be the problems causing these exceptions, and what solutions can be implemented to overcome them? Additionally, what alternatives exist for annotating and segmenting data that is too long?

ryanwesslen · March 29, 2023, 6:12pm

Hi @kushal_pythonist,

Have you seen this post?

You should consider modifying your suggested function.

Have you tried to identify any outlier text inputs? For example, try to train on half of your data. If it still doesn't train, then try on first 10%, etc. You may be able to find a few records that drive this.

If you run data-to-spacy --spancat your_dataset then run data debug, you can get some stats on your span characteristics.

If your pipeline contains a spancat component, then this command will also report span characteristics such as the average span length and the span (or span boundary) distinctiveness. The distinctiveness measure shows how different the tokens are with respect to the rest of the corpus using the KL-divergence of the token distributions. To learn more, you can check out Papay et al.’s work on Dissecting Span Identification Tasks with Performance Prediction (EMNLP 2020).

Since this problem is really a spaCy issue (not Prodigy), you may want to search and/or post on spaCy GitHub discussions for a solution, for example:

Topic		Replies	Views
Low score in spancat training	11	365	February 14, 2023
spancat best annotations practices spancat	9	502	November 17, 2022
Span-Categorization Remove Oversize Span example from Dataset. usage , spacy , spancat	1	400	June 24, 2022
spancat out of memory training , spancat	3	1038	April 24, 2022
Spancat data "Boundary tokens are not distinct from the rest of the corpus" spacy , spancat	1	423	February 7, 2023

training long sequence on spancat memory problem

Related topics