Spanish NER by context

Prodigy is an annotation tool that lets you create training data for machine learning models. Whether the models will learn what you want them to learn, and whether your problem can be solved depends on how you decide to break down your problem, the data you're labelling and how you're training the model.

Prodigy can help you label data more efficiently, and if you want to use spaCy (which Prodigy integrates with out-of-the-box), it can also help you run training experiments faster. But legal NLP isn't trivial and you'll likely have to run a lot of experiments and try out different things until you end up with a system that works for you. It's also totally possible that after your experiments, you'll find out that a machine learning system currently isn't able to beat your regular expressions :wink:

If you have a set of regular expressions that's working well, you can use those to bootstrap training data and create suggestions. So instead of labelling everything by hand, you'll only need to correct what your rules got wrong.

You also want to keep an eye on the local context around the entities you're interested in – especially the surrounding tokens on each side. Those will be most relevant to make the decision. If the local context doesn't have enough clues, the model may struggle to learn the distinctions. For cases like this, a mix of rules and a statistical model might be a better fit. This thread has more details and examples:

You might also find @honnibal's talk on structuring NLP projects helpful, which also shows some examples of spaCy and Prodigy: