Any suggestions on the crazy tagging example I showed above where nearly every word in a sentence gets labeled? Is that just a lack of enough examples to train a proper model with, or did I do something wrong along the way?
I think it comes down to starting from a model that has the
ORG entity, and then forcing it to adjust to your
COMPANY annotations. I think that makes the learning task quite difficult, and you’ll get better results if you start with a vectors-only or blank model.
@honnibal @ines We have a similar requirement in our project. As part of that, I have created a model using corpus of words from about 1000 documents (all of which have similar structure - as they are machine generated and from same document source) as per below. However, while performing terms.teach with a specified set of words in same context - the Web UI is shows up wayward suggestions - most of which have to be rejected. Below were the steps followed to generate the initial own model from scratch.
So need guidance on:
- Whether below steps are the right approach to create the initial model
- I have used --merge-nps option to create the initial model - as some of Named Entities will be multiword. So passing multiword seeds should be possible right?
- As 60-70% of documents in real time will be machine generated from one system or other - thus following one or other consistent formats - so should the same model generated below be used to train further with patterns file?
- Any documentations on what patterns need to be created for different situations e.g. one of the frequent requirements being - document contains below line, need to extract ABC Pte Ltd (no. of words in company name can vary) into my dataset.
Company: Person @ ABC Pte Ltd
Steps followed to create own model:
python -m prodigy terms.train-vectors abc_model texts_stemmed.txt --spacy-model en_core_web_sm --size 300 --window 5 --min-count 2 --n-workers 2 --merge-nps
Contents of texts_stemmed.txt (list of lists of all words from the 1000 odd documents)
["doc1_word1", "doc1_word2", "doc1_word3", "doc1_word4", "doc1_word5", "doc1_word6"]
["doc2_word1", "doc2_word2", "doc2_word3"]
["doc3_word1", "doc3_word2", "doc3_word3", "doc3_word4", "doc3_word5"]
["docn_word1", "docn_word2", "docn_word3", ...................., "docn_wordm"]