Determining most salient geographic entities in news text

I'm writing a module that will be able, from news abstracts, to determine the most probable location at which the described events are occurring. Examples: "Trade talks between China and USA were held in Singapore", "Singapore hosted trade talks between China and USA". The correct GPE in this case is Singapore.

I experimented with dependency parse in Spacy and found out that I could probably write very complicated rules based on dep_ properties of tokens and possibly parse three morphology. But it seems like a task for training a model, probably a seq2seq which would output GPE entities from the doc in the order of probability.

I'm not sure that Prodigy is the right tool to construct an end-to-end solution (annotating, model definition, training, ...), but if it is, how would I even start? Or would I just be better off using token data in the doc to feed in a custom model?


I can see how an ML solution will be helpful here, but is it necessary to do it sequence-to-sequence? You could have a model that made one binary prediction per GPE entity, using features from a transformer encoder or BiLSTM. You'd probably want to write the custom model in PyTorch, but it should be easy to do the annotations in Prodigy.

I do think having the standard GPE detection as a preprocess will be useful to you, but you can try doing it directly, without that step as well.