NER Training for Corporate Names

Any suggestions on the crazy tagging example I showed above where nearly every word in a sentence gets labeled? Is that just a lack of enough examples to train a proper model with, or did I do something wrong along the way?

I think it comes down to starting from a model that has the ORG entity, and then forcing it to adjust to your COMPANY annotations. I think that makes the learning task quite difficult, and you’ll get better results if you start with a vectors-only or blank model.

@honnibal @ines We have a similar requirement in our project. As part of that, I have created a model using corpus of words from about 1000 documents (all of which have similar structure - as they are machine generated and from same document source) as per below. However, while performing terms.teach with a specified set of words in same context - the Web UI is shows up wayward suggestions - most of which have to be rejected. Below were the steps followed to generate the initial own model from scratch.

So need guidance on:

  1. Whether below steps are the right approach to create the initial model
  2. I have used --merge-nps option to create the initial model - as some of Named Entities will be multiword. So passing multiword seeds should be possible right?
  3. As 60-70% of documents in real time will be machine generated from one system or other - thus following one or other consistent formats - so should the same model generated below be used to train further with patterns file?
  4. Any documentations on what patterns need to be created for different situations e.g. one of the frequent requirements being - document contains below line, need to extract ABC Pte Ltd (no. of words in company name can vary) into my dataset.
    Company: Person @ ABC Pte Ltd

Steps followed to create own model:

python -m prodigy terms.train-vectors abc_model texts_stemmed.txt --spacy-model en_core_web_sm --size 300 --window 5 --min-count 2 --n-workers 2 --merge-nps

Contents of texts_stemmed.txt (list of lists of all words from the 1000 odd documents)

["doc1_word1", "doc1_word2", "doc1_word3", "doc1_word4", "doc1_word5", "doc1_word6"]
["doc2_word1", "doc2_word2", "doc2_word3"]
["doc3_word1", "doc3_word2", "doc3_word3", "doc3_word4", "doc3_word5"]
["docn_word1", "docn_word2", "docn_word3", ...................., "docn_wordm"]