First off, thank you for the beautiful relation annotation tool -- we're really enjoying it's UX and the efficiency with which we can label and visualize relations!!!
We have a transformer-based model that jointly predicts entities and relations. This model uses a byte-pair encoding tokenization, that it quite similar to what SpaCy is doing.. But not similar enough.
When making predictions, a json similar to the rel.manual dataset output is created-- It works well a lot of the time when loaded into Prodigy but there are lots of issues where
doc.char_span(start_index, stop_index) returns none -- eg when BPE splits in the middle of a SpaCy token, and front end index errors when SpaCy tokenization in Prodigy doesn't split as much as BPE tokenizer and our token offset exceed the max position in the PRodigy Spacy doc.
To fix this I've tried - using the same SpaCy model (en_core_web_lg) as a pretokenizer for BPE, and tried using the
spacy.gold.align on normal SpaCy tokenized input with the BPE encoded input to set the span start and stop on the transformer side.
This last effort works but it's a lot of work to feed into Prodigy and it causes errors when SpaCy merges tokens that shouldn't be merged.
It's also worth noting that I initially supplied tokens, spans and relations from the BPE sheme but these wouldn't load properly.
So first of all , this is a pain-point -- I don't think it makes sense to try to couple the tokenization between the predictive model and the Prodigy model, even using the same SpaCy models I'm getting differences in tokenization behavior, we may want to try different tokenization scheme on either end and there isn't a straightforward way to ensure compatibility.
So I'm thinking the best way forward (because the other approaches don't really work) is to use the span character offset positions to snap to SpaCy tokens on the prodigy side. Is there a good function to just get the SpaCy token given a character index? Maybe such an approach could be integrated into Prodigy as a loading option to increase compatibility with non-SpaCy-based approacher