Wikipedia entities

MBSanchez · May 24, 2019, 8:28am

Hi! I am planning to train a model that identifies entities from a variety of subjects as economy, energy, health…

As I don’t have a list of terms from which I can match patterns using the Matcher because it would be too wide, I have thought in getting entities from Wikipedia and the categories for those entities that will give me the first entity types to correct in Prodigy and train my model (I don’t know if it is a correct approach).

For doing this I have thought in adding a pipeline component where I can check if a Wikipedia page exists for that token and the category for that page that I would add to my entities (I am planning to use pywikibot for getting those categories)

As you can see, I am a bit lost . I have search for a model that would do that but I was not able to find it so I thought about this solution that I don’t know if it is too extravagant.

Could you help me telling me what would be your approach to this problem?

Thanks!

ines · May 24, 2019, 10:05am

Yes, this sounds like a good plan

This is an interesting idea! You might also check the noun chunks or sequences of proper nouns (one or more PROPN tokens etc.) to capture more options. Otherwise, you'll only be checking for single tokens, which is pretty limiting.

Also make sure that you have some logic that maps the Wikipedia categories to your label scheme. You don't want to be using whatever Wikipedia suggests as a category as your entity label – instead, you want to make sure that a category like "Politician" is mapped to PERSON, and so on.

Finally, making many requests to Wikipedia could potentially be quite slow, so you probably want to use your component for pre-processing the texts, export the result as a JSONL file and load that into Prodigy (instead of doing it all in the recipe script).

Instead of going from your text → Wikipedia, you could also try the opposite approach and extract the top Wikipedia pages for your subjects and then create matcher patterns for them. For instance, let's say one of your subjects is politics in Spanish. You could then start by querying Wikipedia lists for politicians, political parties etc. from Spanish Wikipedia and use the titles for your match patterns.

Topic		Replies	Views
Best practice for NER annotating a new label on Wiki ner , best-practices	2	572	February 24, 2021
Entity Linking Epoch and resources required usage , ner , nel , training	2	573	August 18, 2021
Add a whole bunch of entities via a vocabulary usage , ner , spacy	2	379	July 13, 2021
training a new entity type with Prodigy usage , ner	4	613	March 8, 2019
ner.teach to silver to gold -- how to best leverage Prodigy's recipes usage , ner	2	1292	August 19, 2019

Wikipedia entities

Related topics