sense2vec training questions


My team and I are looking to create a NER model that will be used to detect 'government speak' from a large corpus of text data. Some labels are fairly standard for NER, such as PERSON, LOCATION, ORGANISATION and others are a bit more specific, like FORM (e.g p45, passenger locator form, p60) or SCHEME (e.g. winter fuel allowance, help to buy, shared parental leave).

We think that we would benefit from using sense2vec with some seed terms like those above, in order to find synonyms or contextually similar words which would form match patterns.

When training sense2vec, do you extend the training that has been done before, such as that done on the Reddit comments for example, or do you start from scratch?
If the former, do you reccomend 1bn words on-top-of the Reddit corpus, or 1bn words from scratch?

Can you use the corpus for which you will be performing NER on later, to train your sense2vec model?

Is it very resource intensive? We will be able to spin up some GPUs in the cloud, but are there any rough estimates we can get for a dataset of 1bn words?

Many thanks for any advice.


We'd recommend ~1bn words in total when training from scratch.

Yes, that's generally fine, especially since you'd probably one be using a subset of the data for annotation if you have a lot of raw text available.

The part that's somewhat resource intensive is parsing the raw data with spaCy so you can merge noun phrases, entities etc. This is the step that saves out the serialized parsed Doc objects as .spacy files. But the nice thing is that you can totally run this in parallel so we just used a cluster with multiple workers for this. The step of actually training the vectors isn't very resource intensive so you can run it on a basic server wtih just a CPU.

Thanks so much for getting back to me on this - much appreciated