NER Training for Corporate Names

alexvollmer · June 17, 2019, 11:57pm

Any suggestions on the crazy tagging example I showed above where nearly every word in a sentence gets labeled? Is that just a lack of enough examples to train a proper model with, or did I do something wrong along the way?

honnibal · June 20, 2019, 9:33pm

I think it comes down to starting from a model that has the ORG entity, and then forcing it to adjust to your COMPANY annotations. I think that makes the learning task quite difficult, and you’ll get better results if you start with a vectors-only or blank model.

datawizard · September 4, 2019, 5:58am

@honnibal @ines We have a similar requirement in our project. As part of that, I have created a model using corpus of words from about 1000 documents (all of which have similar structure - as they are machine generated and from same document source) as per below. However, while performing terms.teach with a specified set of words in same context - the Web UI is shows up wayward suggestions - most of which have to be rejected. Below were the steps followed to generate the initial own model from scratch.

So need guidance on:

Whether below steps are the right approach to create the initial model
I have used --merge-nps option to create the initial model - as some of Named Entities will be multiword. So passing multiword seeds should be possible right?
As 60-70% of documents in real time will be machine generated from one system or other - thus following one or other consistent formats - so should the same model generated below be used to train further with patterns file?
Any documentations on what patterns need to be created for different situations e.g. one of the frequent requirements being - document contains below line, need to extract ABC Pte Ltd (no. of words in company name can vary) into my dataset.
Company: Person @ ABC Pte Ltd

Steps followed to create own model:

python -m prodigy terms.train-vectors abc_model texts_stemmed.txt --spacy-model en_core_web_sm --size 300 --window 5 --min-count 2 --n-workers 2 --merge-nps

Contents of texts_stemmed.txt (list of lists of all words from the 1000 odd documents)

[
["doc1_word1", "doc1_word2", "doc1_word3", "doc1_word4", "doc1_word5", "doc1_word6"]
["doc2_word1", "doc2_word2", "doc2_word3"]
["doc3_word1", "doc3_word2", "doc3_word3", "doc3_word4", "doc3_word5"]
....
["docn_word1", "docn_word2", "docn_word3", ...................., "docn_wordm"]
]

Topic		Replies	Views
Company name matching usage , ner	1	1309	March 16, 2020
Transfer Learning for NER usage , ner	6	2505	May 24, 2021
Manual Input of Entities to a prodigy database usage , ner , solved	5	430	July 10, 2021
Text normalization / conversion with Prodigy / spaCy usage , ner , spacy	3	1526	August 20, 2018
strategy for training multiple entities usage , ner	2	624	February 14, 2019

NER Training for Corporate Names

Steps followed to create own model:

Contents of texts_stemmed.txt (list of lists of all words from the 1000 odd documents)

Related topics