Hello!
My team and I are looking to create a NER model that will be used to detect 'government speak' from a large corpus of text data. Some labels are fairly standard for NER, such as PERSON, LOCATION, ORGANISATION and others are a bit more specific, like FORM (e.g p45, passenger locator form, p60) or SCHEME (e.g. winter fuel allowance, help to buy, shared parental leave).
We think that we would benefit from using sense2vec with some seed terms like those above, in order to find synonyms or contextually similar words which would form match patterns.
When training sense2vec, do you extend the training that has been done before, such as that done on the Reddit comments for example, or do you start from scratch?
If the former, do you reccomend 1bn words on-top-of the Reddit corpus, or 1bn words from scratch?
Can you use the corpus for which you will be performing NER on later, to train your sense2vec model?
Is it very resource intensive? We will be able to spin up some GPUs in the cloud, but are there any rough estimates we can get for a dataset of 1bn words?
Many thanks for any advice.
Rory