I am working on a project with an end objective of annotating the entire Wikipedia English corpus for a new entity type 'SOCIAL_ENTERPRISE' that overlaps with but is distinct from the 'ORG' entity type recognized by the Spacy's en_core_web_lg
model. Since I am new to Prodigy and in general to machine learning, I was wondering what would be the best approach to have the most accurate NER model for the Wiki corpus.
More specifically, I have the following questions:
- Do you suggest that I use fully manual annotation or annotation with suggestions from en_core_web_lg?
- How should I go about creating the gold standard small dataset? Should this dataset be based on texts from Wikipedia itself? How many sentences should this dataset ideally have? How do I pick this sample of sentences? Should the dataset contain 50% sentences with at least one SOCIAL_ENTERPRISE entity in it and 50% with no entity of this type?
Please advise as I am not aware of the best practices.