Hi everybody,
My team and I are currently working on a domain-specific project and I am writing to ask for your advice on how to proceed, since I am relatively new to NLP and spacy.
So far I have used Spacy and Prodigy to perform NER
tasks on a text corpus of legislative documents. The basic idea was to identify and extract different kinds of entities that are mentioned in our corpus. I have been able to annotate documents in various ways, to import custom word vectors to spacy and to pre-train tok2vec weights to boost model performance. So far so good.
However, recently I have been asked by my team to investigate whether we could add a new element to the analysis. More specifically we want to understand the extent to which the entities are involved in 2 different kinds of situations: we want to highlight either "delegation"
(ie, when an entity is given authority to do something) or "constraint"
(ie, when an entity is given limits to her prerogatives and/or is required to do something) situations.
As a starting point, we looked at the texts and identified a number of key elements (mostly verbs), we grouped those into 'families', and we came up with a sort of grid that helps identify each situation depending on how the elements are used. For example a permissive modal verb often indicates a "delegation"
situation ("the ENTITY may designate.."), but when the same modal verb is used in negative form it indicates a "constraint"
(the ENTITY may not.. do this or that).
The point is I am a bit lost now on how to proceed as far as the choice of the appropriate model goes.
At first I thought that something along the lines of the "relation extraction" method would be the best fit for our needs.
On the other hand I wonder whether a custom POS
model could be a solution, so as to identify the families of verbs we are interested in, as well as the way they are used in each sentence (negative vs. standard form, active or passive, with or without a modal verb, and so on). Eventually we would combine the results of the NER
with the POS
tags.
Any advice you could give me would be much appreciated. What would be the best course of action in your opinion?
Moreover, I have trouble figuring out how the NER
part of the analysis would fit into the whole picture, in practice. Should we train NER
and relation extraction sequentially or at the same time? And what about the annotation?
Thanks, I'm looking forward to your input.
best,
g.