I am working on a project with an end objective of annotating the entire Wikipedia English corpus for a new entity type 'SOCIAL_ENTERPRISE' that overlaps with but is distinct from the 'ORG' entity type recognized by the Spacy's
en_core_web_lg model. Since I am new to Prodigy and in general to machine learning, I was wondering what would be the best approach to have the most accurate NER model for the Wiki corpus.
More specifically, I have the following questions:
- Do you suggest that I use fully manual annotation or annotation with suggestions from en_core_web_lg?
- How should I go about creating the gold standard small dataset? Should this dataset be based on texts from Wikipedia itself? How many sentences should this dataset ideally have? How do I pick this sample of sentences? Should the dataset contain 50% sentences with at least one SOCIAL_ENTERPRISE entity in it and 50% with no entity of this type?
Please advise as I am not aware of the best practices.
Annotating Wikipedia is a bit special, because there's so much metadata to take advantage of. It does complicate the project quite a lot, but if you really want to get the annotations onto all the Wikipedia text, it's really a good option.
This paper is a good starting point into the literature for using Wikipedia markup for NER: [PDF] Learning multilingual named entity recognition from Wikipedia | Semantic Scholar . You can look at the citations to see what papers have been done since, or the related papers.
The idea, briefly, is that working with Wikipedia lets you work at the page level, rather than the sentence level. If you can classify a page as being about a social enterprise, then any article that mentions that social page will be a reference to a social enterprise as well. It's also against Wikipedia editing guidelines to mention an entity that hasn't been linked anywhere in the page yet. So when you're working with text in Wikipedia, all of the other entity mentions that aren't links are going to co-refer to one of the entities that has been linked. This can help you bootstrap the annotations very effectively.
Many thanks @honnibal . I will be perusing the PDF and related literature.