Working at the character level


I want to use Prodigy to manually annotate with manual (but possibly also use teach) with NERs at the character-as-token (as opposed to word-as-token) level. This is for use with a very specific dataset that often runs “words” together and heavily abbreviates them. For example, the phrase “Two Green Lamborghini Gallardos and Four Red Ferrari Enzos” might be written as “2LAMBGDOSRD, 4FARRENZGRN”.

I assume I can do the basic annotations (i.e. manual) part by replacing the SpaCy tokeniser in the pipeline with something that just splits at characters?

Also what would it take for the teach part to work? I’m assuming I need the above, plus vectors trained on my dataset at the character (plus character-gram?) level - plus something to replace the dependency parser (maybe giving every token a fixed POS like “CHAR” - and making each character the child of the last)?

Does this sound remotely plausible?

Yes, you could replace the tokenizer, although the characters would then be displayed spaced out...So this might not be ideal for usage.

Well, it depends what standard you mean for "work" :slight_smile:. The NER doesn't require pre-trained vectors, and it doesn't use POS features. So you don't have to do anything but change the tokenization.

However...I'm really not sure you'd get good results. The NER has a lot of structural assumptions about things like the window size it's looking at, the fact that it expects most words to be rare, etc. Your data would really be breaking all of these assumptions. If you do want to try it, I would advise installing PyTorch and adding the setting hyper_params["bilstm_depth"] = 2 to the ner.batch-train recipe. Also, definitely focus on ner.batch-train first not ner.teach: you need to get an initial model trained, otherwise the ner.teach recipe won't work.

I think you'll probably be better off with a rule-based approach to segment your tokens, or perhaps a custom machine learning model. But I suppose it's easy to try spaCy's and see how it goes, so you may as well give it a try. The bilstm_depth setting I think should give it a slightly better chance, as I think a BiLSTM layer is better suited to the task than the CNN we use by default.

1 Like

Maybe it’s a question for other topic, if we tune the ["bilstm_depth"] hyperparam would also increase the window-size in which the representation is conditioned?

Yes, definitely. With the BiLSTM, the decisions will be conditioned on the whole input.

@honnibal Hello. I have tried using ["bilstm_depth"] = 2 In our use case we need consider long time dependencies, but also I have been doing pretraining in the CNN layers over domain specific embeddings.

I saw in spacy we could also pretrain the BI-LSTM from scratch but It would need to pretrain again, it would take some days given the amount of data and we will like to use what we already have.

My question will be if it’s possible to stack the BI-LSTM layer on top of the already pretrained CNNs so It will be used the CNN as the Embed step with the Language Modeling for the specific vectors and the BI-LSTM as the Encoder of the surrounding context. I have tried the following:

optimizer = nlp.begin_training(component_cfg={"ner": {'embed_rows': 5000, 'require_vectors': False, 'cnn_maxout_pieces': 3, 'token_vector_width': 128, 'conv_depth': 7, "bilstm_depth":2}})

    with open(token2vec_dir +"/model100.bin", "rb") as file_:


File “/home/janzz11/anaconda3/envs/Spacy_New_2.1/lib/python3.6/site-packages/thinc/extra/”, line 95, in from_bytes
filelike = BytesIO(data)
TypeError: a bytes-like object is required, not ‘dict’

Without the bilstm_depth or without reading the pretrained weigths file it works.

I think this architecture could be useful, the models will be smaller (we are leaving the vectors out), so we could have multiple languages at the same time, model long dependencies with the BI-LSTMs, and resume the pretraining of the CNNs and after it’s computationally very efficient and fast to train with spacy.

@AlejandroJCR I think what you’re saying could work, although there can always be some fiddly details in what you fine-tune in such an architecture. We’ve also been experimenting with self-attention layers instead of BiLSTM.

Overall, I would caution that you should try to check whether your architecture works first, without the pretraining. If it’s already working well maybe the pretraining can help additionally, but if the model isn’t working at all, pretraining might not be the answer.

Leaving the pretraining to run for a few days does feel like a barrier, but if you’re confident the results will be good, it’s not actually much of a problem to start it running Thursday, do something else Friday and Monday, and pick it back up Tuesday…

1 Like

Thanks you, for your opinion on this. I will let you know about our experiment results in the future. Kind regards