Information extraction from legislative text - Doubts and questions

Hi everybody,

My team and I are currently working on a domain-specific project and I am writing to ask for your advice on how to proceed, since I am relatively new to NLP and spacy.

So far I have used Spacy and Prodigy to perform NER tasks on a text corpus of legislative documents. The basic idea was to identify and extract different kinds of entities that are mentioned in our corpus. I have been able to annotate documents in various ways, to import custom word vectors to spacy and to pre-train tok2vec weights to boost model performance. So far so good.

However, recently I have been asked by my team to investigate whether we could add a new element to the analysis. More specifically we want to understand the extent to which the entities are involved in 2 different kinds of situations: we want to highlight either "delegation" (ie, when an entity is given authority to do something) or "constraint" (ie, when an entity is given limits to her prerogatives and/or is required to do something) situations.

As a starting point, we looked at the texts and identified a number of key elements (mostly verbs), we grouped those into 'families', and we came up with a sort of grid that helps identify each situation depending on how the elements are used. For example a permissive modal verb often indicates a "delegation" situation ("the ENTITY may designate.."), but when the same modal verb is used in negative form it indicates a "constraint" (the ENTITY may not.. do this or that).

The point is I am a bit lost now on how to proceed as far as the choice of the appropriate model goes.

At first I thought that something along the lines of the "relation extraction" method would be the best fit for our needs.

On the other hand I wonder whether a custom POS model could be a solution, so as to identify the families of verbs we are interested in, as well as the way they are used in each sentence (negative vs. standard form, active or passive, with or without a modal verb, and so on). Eventually we would combine the results of the NER with the POS tags.

Any advice you could give me would be much appreciated. What would be the best course of action in your opinion?

Moreover, I have trouble figuring out how the NER part of the analysis would fit into the whole picture, in practice. Should we train NER and relation extraction sequentially or at the same time? And what about the annotation?

Thanks, I'm looking forward to your input.
best,
g.

Hi!

It's a good question to bring up, because in general there are often a few different approaches to any given NLP challenge. What isn't clear to me for this use-case, is what type of information exactly you want to retrieve. I understand you want to find the difference between "delegation" and "constraint". But do you want to annotate what authority exactly is given/constrained? In

"the ENTITY may not do this or that"

do you want to extract the span for "this and that"? And how will you process that span in a downstream application? Or would it be sufficient just to know that the ENTITY is being constrained, without further specifics?

If you want to annotate the span, I'd advice you to look into spaCy's spancategorizer which is designed for spans that are not necessarily named entities like cities or persons. It might work better than the ner on the type of entities you've described, though the proof is always in the pudding :wink:

If you don't necessarily care about the actual span, but just want to know whether or not there is a constraint, you might also consider using the textcat. This would be applicable particularly when you have one entity in a sentence that you care about, and you just want to label whether that sentence describes a delegation or a constraint. The textcat will be able to take more of the context of the sentence in consideration.

I'd advice you to also go through this recent Prodigy thread where a similar trade-off between different approaches was discussed:

You might have considered this already, but a rule-based approach could potentially also work for your use-case. In particular, check out spaCy's dependency matcher that might help you get the ground running, or at least bootstrap some quick samples for further annotation/curation!

Should we train NER and relation extraction sequentially or at the same time?

I'm not convinced this will be the right approach, as NER+REL is mostly meant for cases like "PERSON lives-in CITY" where the named entities are clear and can be trained independently of the REL model in a first step. In your use-case, it sounds like the relation and the definition of the new "entity" (delegation/constraint) are kind of entangled, and I'm not sure you could predict the entities without predicting the relation, so it feels like a bad fit. I might be wrong though, and happy to discuss more, but then it would also be good to get some more concrete data examples and how you want to process them in your downstream pipeline :slight_smile:

1 Like

Hi Sofie,
thanks a lot for your answer and your advice.

I am now realizing that the description I gave of our use-case was maybe too condensed. I'm going to try to explain in more detail what exactly are our needs and expectations.

  1. The first, preliminary task that we want to accomplish is to identify and label different kinds of institutions, agencies and authorities. We are interested in 5 macro categories and we are currently in the process of improving the accuracy of our NER model.

This is an example of a typical sentence from a piece of legislation that we are interested in annotating.

image

  1. The second step in our analysis would be to try and understand whether the named entities that we extracted via NER are in some way limited in their action by the provision of the law (a constraint situation) or on the contrary they are given power over something or to do something (a delegation situation).

For example, in the sentence attached above we would consider the member states to be subject to a constraint. In the language of the legislators in fact, the words "shall inform" configure an obligation for the member states.

I understand you want to find the difference between "delegation" and "constraint". But do you want to annotate what authority exactly is given/constrained?

No, we are not interested in further specifying how the entities are constrained/delegated.

Nonetheless, we face two sets of problems.

a) We want to highlight constraint/delegation situations only if they involve one of the named entitities detected by our NER model. Moreover in cases of sentences with more than one entity, we want to know which one is involved in the relationship (ie, in the example above, we want the model to understand that the constraint is imposed on the member states, ignoring the other entity).

b) Second problem, sometimes the same syntactic cues need to be interpreted differently (in terms of delegation vs. constraint) depending on which type of institution is involved. Take a sentence such as

image

In this case the same structure (modal verb 'shall' + verb) which configured a constraint for the member states in the first example, here configures a delegation of power to the Commission! This is entirely due to theoretical reasons related to what the legislators mean to say when they talk about different types of entities, but at the end of the day we have a situation in which the same elements could lead to different outcomes in terms of delegation/constraint depending on who the story is about.

In light of all this - and I apologize for the long explainer - I am not sure that the span categorizer (or a classification of the whole sentence via textcat) would be a viable option.

At first thought my impression is that our best shot would be to train a POS tagger to recognize a set of syntactical units and forms (such as different 'families' of verbs), and to combine the prediction of such model with the results of the NER model that identifies the different institutions. To this end, we elaborated a grid to associate different combinations of named entities + syntactic elements with an outcome in terms of delegation or constraint.

Going back to the previous example, we would have:

  • a NER model that identifies "the commission" as a relevant entity;
  • a POS tagger that identifies the subject ("the commission") and the modal verb "shall" followed by a verb in base form.

We would then be able to associate the right outcome to this specific combination.

Does it make sense to you? (Is it even feasible? I don't know whether a Pos model could even be customised in this way..) Or would you proceed differently?

For example, would a "relation extraction" approach be able to overcome the problem outlined in (b)?

Thanks in advance =)
g.

Considering your further explanations, I do think this approach may work, though it depends on the variability of these kind of constructs in your texts. If you're able to identify the entities well, you should be able to find those grammatical constructs as well. Either with some kind of POS tagger as you mentioned, or the Dependency matcher which I mentioned before.

But if I understand your case correctly, you'd then still need to decide on the final label of the full construct - whether "ENTITY shall VERB" (to put it simply) is a constraint or delegation. For this, you might still be able to use the spancat. Basically you'd have a complex "suggester" function that finds the entities and the verbs, and labels the full phrase as a candidate entity, and then the spancat can learn the difference between constraint or delegation.