Working at the character level

Hi.

I want to use Prodigy to manually annotate with manual (but possibly also use teach) with NERs at the character-as-token (as opposed to word-as-token) level. This is for use with a very specific dataset that often runs “words” together and heavily abbreviates them. For example, the phrase “Two Green Lamborghini Gallardos and Four Red Ferrari Enzos” might be written as “2LAMBGDOSRD, 4FARRENZGRN”.

I assume I can do the basic annotations (i.e. manual) part by replacing the SpaCy tokeniser in the pipeline with something that just splits at characters?

Also what would it take for the teach part to work? I’m assuming I need the above, plus vectors trained on my dataset at the character (plus character-gram?) level - plus something to replace the dependency parser (maybe giving every token a fixed POS like “CHAR” - and making each character the child of the last)?

Does this sound remotely plausible?

Yes, you could replace the tokenizer, although the characters would then be displayed spaced out…So this might not be ideal for usage.

Well, it depends what standard you mean for “work” :slight_smile:. The NER doesn’t require pre-trained vectors, and it doesn’t use POS features. So you don’t have to do anything but change the tokenization.

However…I’m really not sure you’d get good results. The NER has a lot of structural assumptions about things like the window size it’s looking at, the fact that it expects most words to be rare, etc. Your data would really be breaking all of these assumptions. If you do want to try it, I would advise installing PyTorch and adding the setting hyper_params["bilstm_depth"] = 2 to the ner.batch-train recipe. Also, definitely focus on ner.batch-train first not ner.teach: you need to get an initial model trained, otherwise the ner.teach recipe won’t work.

I think you’ll probably be better off with a rule-based approach to segment your tokens, or perhaps a custom machine learning model. But I suppose it’s easy to try spaCy’s and see how it goes, so you may as well give it a try. The bilstm_depth setting I think should give it a slightly better chance, as I think a BiLSTM layer is better suited to the task than the CNN we use by default.

1 Like

Maybe it’s a question for other topic, if we tune the ["bilstm_depth"] hyperparam would also increase the window-size in which the representation is conditioned?

Yes, definitely. With the BiLSTM, the decisions will be conditioned on the whole input.