Best practice for NER annotating a new label on Wiki

santoshbs · February 17, 2021, 11:55pm

I am working on a project with an end objective of annotating the entire Wikipedia English corpus for a new entity type 'SOCIAL_ENTERPRISE' that overlaps with but is distinct from the 'ORG' entity type recognized by the Spacy's en_core_web_lg model. Since I am new to Prodigy and in general to machine learning, I was wondering what would be the best approach to have the most accurate NER model for the Wiki corpus.

More specifically, I have the following questions:

Do you suggest that I use fully manual annotation or annotation with suggestions from en_core_web_lg?
How should I go about creating the gold standard small dataset? Should this dataset be based on texts from Wikipedia itself? How many sentences should this dataset ideally have? How do I pick this sample of sentences? Should the dataset contain 50% sentences with at least one SOCIAL_ENTERPRISE entity in it and 50% with no entity of this type?

Please advise as I am not aware of the best practices.

honnibal · February 23, 2021, 12:51am

Annotating Wikipedia is a bit special, because there's so much metadata to take advantage of. It does complicate the project quite a lot, but if you really want to get the annotations onto all the Wikipedia text, it's really a good option.

This paper is a good starting point into the literature for using Wikipedia markup for NER: [PDF] Learning multilingual named entity recognition from Wikipedia | Semantic Scholar . You can look at the citations to see what papers have been done since, or the related papers.

The idea, briefly, is that working with Wikipedia lets you work at the page level, rather than the sentence level. If you can classify a page as being about a social enterprise, then any article that mentions that social page will be a reference to a social enterprise as well. It's also against Wikipedia editing guidelines to mention an entity that hasn't been linked anywhere in the page yet. So when you're working with text in Wikipedia, all of the other entity mentions that aren't links are going to co-refer to one of the entities that has been linked. This can help you bootstrap the annotations very effectively.

santoshbs · February 24, 2021, 10:17pm

Many thanks @honnibal . I will be perusing the PDF and related literature.

Topic		Replies	Views
NEL multiple results usage , nel	1	343	October 14, 2021
Wikipedia entities usage , ner	1	492	May 24, 2019
Improve trained models with annotations usage , ner , training	3	519	September 20, 2021
annotating entities in text documents usage , ner , solved	15	9922	November 28, 2017
spaCy, prodigy, annotation usage , ner , solved	2	721	February 8, 2019

Best practice for NER annotating a new label on Wiki

Related topics