identify legal terms

Hi! There's no easy answer and it really depends on your data, what you're trying to extract, and so on. In some cases like a citation or case name, you might be able to predict the span directly as a named entity recognition task. In other cases, this is going to be very difficult to learn and it makes a lot more sense to predict a category over the whole sentence. And then there are things that can be extracted using token-based rules or a combination of rules and more general linguistic features. Ultimately, you want to try out different approaches and evaluate them on a representative set of annotated examples to find out what works best.

I'd highly recommend checking out Daniel Hoadley's work on blackstone, a spaCy pipeline and model for processing legal texts (in English). The Readme features a bunch of examples and there are also blog posts that discusss some of the considerations – like, when to model a task as a named entity recognition problem, when to do text classification etc. And some of the components make very clever use of rules to detect abbreviations and improve sentence boundary detection.

2 Likes