Hi I was wondering if there is a way in which I can manually input a dictionary of company names into the prodigy database and label them for example as ORG. If not is it possible to create my own custom recipe for this?
Thanks
example
{"text": "WMT"}
{"text": "Walmart Inc."}
{"text": "Walmart"}
{"text": "PTR"}
{"text": "PetroChina Company Limited"}
{"text": "VWAGY"}
{"text": "Volkswagen AG ADR Repstg 1/10th Sh"}
{"text": "Volkswagen"}
{"text": "AMZN"}
{"text": "Amazon.com Inc."}
{"text": "Amazon"}
{"text": "Amazon.com"}
{"text": "KELYB"}
{"text": "Kelly Services Inc. Class B Common Stock"}
{"text": "Kellyservices"}
{"text": "Kelly Services"}
{"text": "KELYA"}
{"text": "Kelly Services Inc. Class A Common Stock"}
{"text": "CHL"}
{"text": "China Mobile Limited"}
{"text": "Chinamobileltd"}
{"text": "China Mobile"}
Hi! In that case, you could just load the patterns with spaCy directly to label all matches automatically and then use that data to pretrain you model. My comment here explains how to do this:
Using the EntityRuler has the advantage that it takes patterns in the same format as Prodigy and takes care of filtering out overlaps (which can theoretically occur with multiple patterns).
Sorry another question, I was wondering if Prodigy would be able to pick up misspellings from commenters in social media groups, for example instead of Volkswagen, someone comments it as Volksvagen. Is there a way in which we can compare an actual Prodigy database entity with the misspelled 'entity', and if the correlation is high enough Prodigy could identify it as an entity? Or would I have to manually curate all the misspelling or slang people use for companies.
Yes I already have a set amount of curated labels in a Prodigy model. What I forgot to ask in my question is that, when I upload it directly to the Prodigy database is there I way I can set an ORG label to all of the companies in that json file?
This is something that a trained model would be able to do, and one of the advantages of training a model to predict similar entities in similar contexts (as opposed to just exact pattern matching). So if your training data is good and representative, your model will also be able to pick up on similar entities, including misspellings.
The approach I linked above lets you create Prodigy annotations based on a patterns file, so if you include patterns with the label ORG, the matches will be labelled as ORG in the data.