Information extraction from legislative text - Doubts and questions

Hi!

It's a good question to bring up, because in general there are often a few different approaches to any given NLP challenge. What isn't clear to me for this use-case, is what type of information exactly you want to retrieve. I understand you want to find the difference between "delegation" and "constraint". But do you want to annotate what authority exactly is given/constrained? In

"the ENTITY may not do this or that"

do you want to extract the span for "this and that"? And how will you process that span in a downstream application? Or would it be sufficient just to know that the ENTITY is being constrained, without further specifics?

If you want to annotate the span, I'd advice you to look into spaCy's spancategorizer which is designed for spans that are not necessarily named entities like cities or persons. It might work better than the ner on the type of entities you've described, though the proof is always in the pudding :wink:

If you don't necessarily care about the actual span, but just want to know whether or not there is a constraint, you might also consider using the textcat. This would be applicable particularly when you have one entity in a sentence that you care about, and you just want to label whether that sentence describes a delegation or a constraint. The textcat will be able to take more of the context of the sentence in consideration.

I'd advice you to also go through this recent Prodigy thread where a similar trade-off between different approaches was discussed:

You might have considered this already, but a rule-based approach could potentially also work for your use-case. In particular, check out spaCy's dependency matcher that might help you get the ground running, or at least bootstrap some quick samples for further annotation/curation!

Should we train NER and relation extraction sequentially or at the same time?

I'm not convinced this will be the right approach, as NER+REL is mostly meant for cases like "PERSON lives-in CITY" where the named entities are clear and can be trained independently of the REL model in a first step. In your use-case, it sounds like the relation and the definition of the new "entity" (delegation/constraint) are kind of entangled, and I'm not sure you could predict the entities without predicting the relation, so it feels like a bad fit. I might be wrong though, and happy to discuss more, but then it would also be good to get some more concrete data examples and how you want to process them in your downstream pipeline :slight_smile:

1 Like