sense2vec training questions

rory-hurley-gds · March 22, 2022, 11:51am

Hello!

My team and I are looking to create a NER model that will be used to detect 'government speak' from a large corpus of text data. Some labels are fairly standard for NER, such as PERSON, LOCATION, ORGANISATION and others are a bit more specific, like FORM (e.g p45, passenger locator form, p60) or SCHEME (e.g. winter fuel allowance, help to buy, shared parental leave).

We think that we would benefit from using sense2vec with some seed terms like those above, in order to find synonyms or contextually similar words which would form match patterns.

When training sense2vec, do you extend the training that has been done before, such as that done on the Reddit comments for example, or do you start from scratch?
If the former, do you reccomend 1bn words on-top-of the Reddit corpus, or 1bn words from scratch?

Can you use the corpus for which you will be performing NER on later, to train your sense2vec model?

Is it very resource intensive? We will be able to spin up some GPUs in the cloud, but are there any rough estimates we can get for a dataset of 1bn words?

Many thanks for any advice.

Rory

ines · March 27, 2022, 12:39pm

We'd recommend ~1bn words in total when training from scratch.

Yes, that's generally fine, especially since you'd probably one be using a subset of the data for annotation if you have a lot of raw text available.

The part that's somewhat resource intensive is parsing the raw data with spaCy so you can merge noun phrases, entities etc. This is the step that saves out the serialized parsed Doc objects as .spacy files. But the nice thing is that you can totally run this in parallel so we just used a cluster with multiple workers for this. The step of actually training the vectors isn't very resource intensive so you can run it on a basic server wtih just a CPU.

rory-hurley-gds · April 13, 2022, 1:30pm

Thanks so much for getting back to me on this - much appreciated

Topic		Replies	Views
custom sense2vec usage	5	1421	August 15, 2021
sense2vec: updated library, new vectors, tutorial for bootstrapping NER models, more Prodigy recipes & open-source datasets project , news	1	1002	November 27, 2019
Workflow re: Custom Sense2Vec on New Data ner , textcat , spacy	10	2733	April 20, 2020
Trying to re-train sense2vec	1	236	January 9, 2023
sense2vec.teach vectors usage , solved , sense2vec	3	1054	August 12, 2021

sense2vec training questions

Related topics