Greetings. A colleague of mine started using prodigy recently. I got interested in it because I need model to analyse quite chaotic text from internet. So sometimes there are missing spaces and two or more words combine in one: "forexample likethis".
I want to be able to select subwords.
In previous example I should be able to select "for" and "example" separately (or "forex" and "ample"). I already found that Spacy don't work with subwords and I don't want to use any type of vocabulary to divide words in advance.
So my question is - can i do something like this with Prodigy? and if yes can you guide me or give me any similar examples?
Hi! If you're using Prodigy for manual span annotation, it will pre-tokenize the text so your selection can snap to the token boundaries. For most token-based annotation tasks (NER, POS tags), this is nice, because you don't have to hit the exact boundaries and can annotate much faster. It also lets you spot tokenization issues early because you can't really train token-based models on annotations that don't map to tokens.
However, it does mean that you can't just select half a token. If words with missing spaces appear only occasionally, you could use a separate label for them (e.g. MESSY_SUBWORDS), highlight the tokens in question and annotate the rest, and then filter out all examples containing "spans" with MESSY_SUBWORDS afterwards. You can then add the missing spaces, or add the character offsets of the subwords (depending on what you need).
You could also stream in only the messy span texts one by one, add a "tokens" field with one token per character and then highlight the individual subwords. This would show you something like: f o r e x a m p l e l i k e t h i s – and you'd then highlight for, example, like and this.
If you want to train a model that predicts token-based tags on annotations that refer to partial tokens, those "subwords" should be individual tokens. Of course, it's always nice to do the segmentation programmatically, but you can also use the retokenizer to split tokens.