NER and Coref/Rel advice

Hello.

I have trained a NER model from scratch for Legal Entities using prodigy and ner.manual/correct. I have annotated over 150 legal documents, which break down to about 4000 paragraphs.
I trained it from scratch and it performs quite well at this point (around 80%), but i have two questions moving on from here.

First, i noticed that the quality of the model has stopped increasing recently with the annotation of additional documents. This is also corroborated by the train-curve that stops increasing after 75%. One of my 3 categories (the least common one) is underperforming, so i wonder, whether the only way to improve my results is by fiddling with the hyper-parameters or if you have any other suggestion/advice?

Then, i want to do the following on top of my NER model. It is common for my Entities to refer/relate to others, for example in the image below (using rel.manual), the paragraphs must be linked to the above article, because by themselves have no value as Entities:

I think that this falls into the coreference resolution territory, but I tried using the coref.manual with my NER model and got an error saying: "The provided model should have a tagger component to allow pattern matching on the POS tags for efficient coreference annotation". I see in the documentation, you note that we need to bring our own model implementation and recommend neuralcoref, but how would that work with my NER model?

Alternatively, i consider using the rel.manual recipe (which starts fine), but even though i feed it my NER model, it does not find the Named Entities (for the same model and text in ner.correct, the Entities would be identified correctly).

Am i doing something wrong here? How would you advise me to proceed? Sorry, for the long post, i hope it is clear what i am trying to say and thank you for your consistent help and support!

Hi!

First, i noticed that the quality of the model has stopped increasing recently with the annotation of additional documents. This is also corroborated by the train-curve that stops increasing after 75%. One of my 3 categories (the least common one) is underperforming, so i wonder, whether the only way to improve my results is by fiddling with the hyper-parameters or if you have any other suggestion/advice?

Some NER challenges are just hard, and it will not always be possible to get >90% F-score. It depends on what kind of entities they are, the lexical variation in text, and the consistency of the annotations. If you have an underrepresented entity, you could try to boost its performance by annotating more of that specific entity, but the danger is that by doing so, the distribution in your annotated data will vary significantly from realistic data, which might hurt the final performance of your system on unseen text. Note that if your test data would also follow the same "artificially skewed" distribution, you won't notice the performance drop in your numbers right away, it will only become apparent when you start using the system in a realistic use-case.

In short: I would reconsider the annotation/guidelines/inherent complexity of the one category that seems more difficult to predict.

It is common for my Entities to refer/relate to others, for example in the image below (using rel.manual), the paragraphs must be linked to the above article, because by themselves have no value as Entities:

I totally get what your trying to do. I think it's important to be clear about the definitions of the various tasks involved here though, because the example annotation in your figure will be difficult to learn as such. If we're talking about "coreference resolution", you would link "Article 490" to "that Article", as those two noun phrases are referring to the same thing. In a consequent step, you then want to link "paragraph 3 and 5" to "that Article". Taking both steps together, you'll be able to understand that "paragraph 3" belongs to Article 490.

For the first step, annotating "article 490" as being coreferential to "that Article", you can definitely use coref.manual for the annotation. The way this recipe works, is that it disables all tokens except for nouns, proper nouns and pronouns, which should make your annotation much more efficient. I understand that your custom trained NER model doesn't have a tagger component. What I recommend you do, is compile a new pipeline with your NER model in it, and source the tagger component from one of our pretrained pipelines, like en_core_web_lg. Once you've created the config file for this, you can run spacy assemble to create the full pipeline, that you can then use as input for coref.manual.

Easier yet, you could also just use en_core_web_lg as the input model for coref.manual, as ideally you'd be looking to annotate coref annotations independently of your NER annotations, as that will ensure the model can generalize sufficiently.

Once you have the coref annotations, you'll need to bring in your own architecture to train a model on them. Ideally, the NER and coref components are independently trained on their respective tasks, and only in downstream processing would you combine the results of the two to deduce the information you need.

Alternatively, have a look into coreferee and try out its trained English model - it might help you hit the ground running.

Assuming you have NER and coref annotation, the next step would be to connect the correct snippets in text together. In your example sentence, "paragraphs 3 and 5" could be in a "part_of" relation with "that Article". You could train a REL model for this, but it depends on the grammatical/syntactic variance in your sentences. If most of your cases are pretty straightforward (as they are in this example), you might consider using something like spaCy's Dependency Matcher.

I hope this helps, let me know if you run into issues with any of this!

1 Like

Hi, @SofieVL!

First of all, my sincerest gratitude for your detailed answer. It helps a lot!

OK, i think i am covered for the NER part. I recognize that my problem is quite challenging and i think that my score of about 75% should be considered successful given the circumstances.

You indeed totally got what i am trying to do and this made it a bit clearer to me as well! So i should break down my problem to 3 different steps. First, the NER, which is already done. Then two separate tasks. Coreference resolution, for expressions like "that Article" or "the above Article" etc. and the Dependency/Relation between "paragraph 1" belongs/refers to "that Article".
Finally, i make a pipeline of those 3 components to solve my problem as a whole?

As for the implementation of these tasks, i first have to do some more research and experimentation, before i can ask the correct questions, but i didn't mention that my data is in Greek (just used an equivalent example in English), so coreferee or many other available models/resources in English won't work unfortunately.

If you have any other suggestions or clarifications, i would appreciate them.
Thanks again for your help and your time!

Happy to hear the reply was useful!

So i should break down my problem to 3 different steps. First, the NER, which is already done. Then two separate tasks. Coreference resolution, for expressions like "that Article" or "the above Article" etc. and the Dependency/Relation between "paragraph 1" belongs/refers to "that Article".
Finally, i make a pipeline of those 3 components to solve my problem as a whole?

Yes - I do think that will be the ideal approach. I definitely agree that this will be more challenging for Greek. If you want to discuss any of the remaining challenges in more detail, you're also very welcome on the spaCy discussions forum, in case you need help with specific implementations or issues you run into.

I also want to mention that we are working on a builtin coref solution in spaCy, but it might still take a few more months to get this finished. We'll likely focus on English as the first thing, but we hope to come to a more generic solution in the longer term that would cover most languages. That said, the issue with coref is getting annotated data, and I'm not aware of a Greek dataset (which is not to say none would exist). Either way, you can follow the progress here: Native coref component by svlandeg · Pull Request #7264 · explosion/spaCy · GitHub

1 Like