failure on active learning: how to make it fail gracefully?

I need to relabel a pre-annotated data.
When I import it and work till let's say document 10, I am getting an error ValueError: Mismatched tokenization..... I understand where the error is coming from.

My question or request is: how can I make it either skip this document or discard annotations around that positions, or discard annotations on that document altogether?

Currently I am learning about it in the middle of manual annotation, and it takes a significant effort to recover manual annotations and restart the job. Is there any flag to make it fail gracefully?

Hi! I think the flat you're looking for is the skip argument on the add_tokens preprocessor: https://prodi.gy/docs/api-components#add_tokens Instead of raising an error, setting skip=True will not add the respective span if the tokenization doesn't match.

If you're using a binary accept/reject workflow, you might want to add another check to ensure that you don't end up with examples with no suggestions. Btw, if you want to find mismatched tokenization and implement your own logic here, the easiest solution is to use spaCy's Doc.char_span – if that returns None for a given span, it doesn't match the Doc's tokenization.