SpanRuler to indicate spans of arbitary length

I am training Spancat-Singlelabel to extract error messages from human written reports.
I've annotated arround 1200 examples and so far the results are OK.

There are many error messages that are almost allways indicated by the same phrases. For example "... validation error: <error_message>. ..."
<error_message> can be different in length (often multiple sentences) and tokens. My training examples include a lot of those cases (annotated is only the error_message, not the indicating phrase before) but spaCy is still missing a lot of error_messages.

So my idea was to use SpanRuler to indicate those relevant passages or prephrases in front of the error messages but i am not sure how to use it to mark a span of arbitary length after it.

Or is it not the right tool for the job?

Hey @Eric !

The role of SpanRuler is really to identify spans using token-based rules or exact phrase matches so I don't think it's appropriate to use if the spans are of arbitrary length and not easily defined via rules.

However, given that validation error: provides good signal that an error message will succeed it, I would actually label these tokens for your model. This way, the model learns to recognise the combination of this indicative phrase and the error message as the relevant span and you can always post-process the spans downstream to not include these tokens. You'd need to make sure that your labelled data is representative and doesn't overfit to error messages with the preceding indicative phrase.

Good luck!

Thank you @india-kerle I was worried that is the case. But shouldn't Spancat recognize the context arround the span and learn the indictive phrase by itself?

No problems. From what I understand, the spancat classifier depends on the suggested spans going in. So, for example, if you define your suggester as suggesting 5-grams, every combination in the moving window of tokens would be considered. As a result, it is also considering surrounding tokens because these occur in the surrounding windows.

However, the spaCy forum might be a more appropriate place to discuss the nuances of the model!