model with subword

Georgy · February 5, 2020, 12:25pm

Greetings. A colleague of mine started using prodigy recently. I got interested in it because I need model to analyse quite chaotic text from internet. So sometimes there are missing spaces and two or more words combine in one: "forexample likethis".

I want to be able to select subwords.
In previous example I should be able to select "for" and "example" separately (or "forex" and "ample"). I already found that Spacy don't work with subwords and I don't want to use any type of vocabulary to divide words in advance.

So my question is - can i do something like this with Prodigy? and if yes can you guide me or give me any similar examples?

Thanks in advance.

ines · February 6, 2020, 12:55pm

Hi! If you're using Prodigy for manual span annotation, it will pre-tokenize the text so your selection can snap to the token boundaries. For most token-based annotation tasks (NER, POS tags), this is nice, because you don't have to hit the exact boundaries and can annotate much faster. It also lets you spot tokenization issues early because you can't really train token-based models on annotations that don't map to tokens.

However, it does mean that you can't just select half a token. If words with missing spaces appear only occasionally, you could use a separate label for them (e.g. MESSY_SUBWORDS), highlight the tokens in question and annotate the rest, and then filter out all examples containing "spans" with MESSY_SUBWORDS afterwards. You can then add the missing spaces, or add the character offsets of the subwords (depending on what you need).

You could also stream in only the messy span texts one by one, add a "tokens" field with one token per character and then highlight the individual subwords. This would show you something like: f o r e x a m p l e l i k e t h i s – and you'd then highlight for, example, like and this.

If you want to train a model that predicts token-based tags on annotations that refer to partial tokens, those "subwords" should be individual tokens. Of course, it's always nice to do the segmentation programmatically, but you can also use the retokenizer to split tokens.

Topic		Replies	Views
ner.train on data not annotated by Spacy? ner	3	1164	June 11, 2018
Basic question about Prodigy annotations and model training. usage , ner	12	772	January 18, 2019
Substring Selection in Front End usage , ner , front-end , solved	2	496	July 10, 2020
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1356	March 11, 2020
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	568	March 27, 2020

model with subword

Related topics