Wikipedia entities

Hi! I am planning to train a model that identifies entities from a variety of subjects as economy, energy, health…

As I don’t have a list of terms from which I can match patterns using the Matcher because it would be too wide, I have thought in getting entities from Wikipedia and the categories for those entities that will give me the first entity types to correct in Prodigy and train my model (I don’t know if it is a correct approach).

For doing this I have thought in adding a pipeline component where I can check if a Wikipedia page exists for that token and the category for that page that I would add to my entities (I am planning to use pywikibot for getting those categories)

As you can see, I am a bit lost :slight_smile:. I have search for a model that would do that but I was not able to find it so I thought about this solution that I don’t know if it is too extravagant.

Could you help me telling me what would be your approach to this problem?


Yes, this sounds like a good plan :+1:

This is an interesting idea! You might also check the noun chunks or sequences of proper nouns (one or more PROPN tokens etc.) to capture more options. Otherwise, you'll only be checking for single tokens, which is pretty limiting.

Also make sure that you have some logic that maps the Wikipedia categories to your label scheme. You don't want to be using whatever Wikipedia suggests as a category as your entity label – instead, you want to make sure that a category like "Politician" is mapped to PERSON, and so on.

Finally, making many requests to Wikipedia could potentially be quite slow, so you probably want to use your component for pre-processing the texts, export the result as a JSONL file and load that into Prodigy (instead of doing it all in the recipe script).

Instead of going from your text → Wikipedia, you could also try the opposite approach and extract the top Wikipedia pages for your subjects and then create matcher patterns for them. For instance, let's say one of your subjects is politics in Spanish. You could then start by querying Wikipedia lists for politicians, political parties etc. from Spanish Wikipedia and use the titles for your match patterns.