identify legal terms

Hello,

I am trying to identify phrases that are related to laws for example:

  • …de acuerdo con lo dispuesto en el artículo 38.4 de la Ley 30/1992, de 26 de noviembre, de Régimen Jurídico...
  • …de conformidad con el artículo 6.2.a).1.º, párrafo segundo, del Real Decreto…
  • ...Infringir la Ley 1098 de 2006 en lo relativo a la prestación de servicios...
  • ...No dar aplicación a los mandatos de la Ley 1751 de 2015, en lo
    correspondiente a la prestación de los servicios de salud .

Can you tell me the best practice to achive this?

Thanks...

Hi! There's no easy answer and it really depends on your data, what you're trying to extract, and so on. In some cases like a citation or case name, you might be able to predict the span directly as a named entity recognition task. In other cases, this is going to be very difficult to learn and it makes a lot more sense to predict a category over the whole sentence. And then there are things that can be extracted using token-based rules or a combination of rules and more general linguistic features. Ultimately, you want to try out different approaches and evaluate them on a representative set of annotated examples to find out what works best.

I'd highly recommend checking out Daniel Hoadley's work on blackstone, a spaCy pipeline and model for processing legal texts (in English). The Readme features a bunch of examples and there are also blog posts that discusss some of the considerations – like, when to model a task as a named entity recognition problem, when to do text classification etc. And some of the components make very clever use of rules to detect abbreviations and improve sentence boundary detection.

2 Likes

Hi Ines,
We would like to train(transfer learning) a custom model like Black Stone on our own text and labels. Should we load the trained 'pipe' or 'replace_pipe'? Also, do we use begin_training or resume_training? The spacy documentation is not very clear. We want to make use of the balckstone model as much as possible. However, we don't want to use the black stone labels and want to train the update/train the weights on domain specific legal text

Hi! If you don't want to re-use any of the labels, you should probably train from scratch, because you won't really be able to take advantage of the existing component weights. We typically also recommend the same if your goal is to add more labels than you want to keep – in that case, it's often more efficient to use the existing model to help you create annotations for the labels you want to keep, add your custom labels on top and then train from scratch. Otherwise, it'll make the results much harder to reason about, because you're constantly "fighting" the existing weights, forgetting effects etc.

nlp.begin_training will reset and randomly initialize the weights. So you typically want to run thast before you start training a new component. nlp.resume_training will resume training of an existing component and its weights, so you'd use that if you don't want to reset the weights. In that case, you definitely want to load the model with the pretrained component.

1 Like

Thanks for the response. When we retrain a trained model from scratch are the vectors/embedding from the base model still retained. Or do we need to load the vectors explicitly?

If you start with a model that has word vectors (e.g. en_vectors_web_lg, en_core_web_lg), those will be used during training. Resetting the weights won't affect the vectors.

1 Like