Relating entities + resolving coreferences in Russian texts

Hi! I need to create a set of entities and interrelate them. The documentation on ner.manual and ref.manul is pretty clear, but I would like to clarify a few details related to coreference resolution if possible.

Say, we have a corpus on different types of vehicles. We want to recognize a few types those, as well as some of their attributes (e.g. speed, size) and then recognize which attribute is related to what entity. The corpus is in Russian, but I'll use English to exemplify.

I think my general steps will be:

  • Annotating separate entites for the vehicle types (say, a car, a ship, or a plane) and for the attributes;

  • Training a model and (if it performs well) plugging it, as well as all pre-annotated data I have, into ref.manual and then annotating the relations.

Now to what I'm least certain about. First, the entities and their attributes may appear a sentence or two apart. I guess we could potentially improve the relation extraction by also annotating with COREF tag and then applying neuralcoref. So e.g. instead of:

I could create attach a label COREF to ref.manual and annotate like this:

So I'd like to ask:

  • First and most important: since there's no support of Russian in neuralcoref, we could try to train a model with our own corpora. But would it be feasible at all, given the very limited bandwidth and modest size of the corpus (a few thousand records)?

  • If so, could this custom Russian model come in handy, as it has a tagger and a parser, and these seem to be the prerequisite?

  • Is it ok to annotate coreference not as a separate recipe but as one of the tags in ref.manual? And if it is, should I annotate the linking nouns/pronouns/phrases in any specific way?

Pardon me for so much text, did my best to word everything in the most concise way I can :slight_smile:

2 Likes

Hi!

Linguistically speaking, I follow your reasoning entirely: it does make more sense to relate "105 km/h" to "It" in the second sentence, as in that case, the relation extraction model will be able to exploit the grammatical relation between the two. Ofcourse this also means you'll have another ML model, and errors may propagate. Adding insult to injury, coreference resolution is quite hard, and training a model from scratch on a limited dataset will be a challenge, I'm afraid.

I find it difficult to advice one approach over the other - I think ultimatelly it comes down to the data and time you're able to invest in this. And you may just need to experiment a little to find out what works best for your use-case.

I do want to note that yes, you can annotate coreference with the ref.manual recipe. The coref.manual recipe specifies some useful defaults for dealing with coreference annotations, like disabling all words except nouns, proper nouns and pronouns, as these are the usual targets. In general, we'd also advice to do your annotations in separate steps, i.e. do the coref first, then run through the data again to annotate the relations. But if you'd prefer to do it all at once, you could just add a label "COREF" (orso) to your label set for ref.manual and use that. In the end, coref can be seen as a specific type of relation extraction after all.

Hope that helps at least a little, sorry I can't be more decisive! :wink:

Hi Sofie!

I suspected these things could not be answered with much certainty, but your answer is still very helpful. We're quite short on resources indeed, so will probably try going without coreference.

Thank you a lot!