Hi!
First, i noticed that the quality of the model has stopped increasing recently with the annotation of additional documents. This is also corroborated by the train-curve that stops increasing after 75%. One of my 3 categories (the least common one) is underperforming, so i wonder, whether the only way to improve my results is by fiddling with the hyper-parameters or if you have any other suggestion/advice?
Some NER challenges are just hard, and it will not always be possible to get >90% F-score. It depends on what kind of entities they are, the lexical variation in text, and the consistency of the annotations. If you have an underrepresented entity, you could try to boost its performance by annotating more of that specific entity, but the danger is that by doing so, the distribution in your annotated data will vary significantly from realistic data, which might hurt the final performance of your system on unseen text. Note that if your test data would also follow the same "artificially skewed" distribution, you won't notice the performance drop in your numbers right away, it will only become apparent when you start using the system in a realistic use-case.
In short: I would reconsider the annotation/guidelines/inherent complexity of the one category that seems more difficult to predict.
It is common for my Entities to refer/relate to others, for example in the image below (using rel.manual), the paragraphs must be linked to the above article, because by themselves have no value as Entities:
I totally get what your trying to do. I think it's important to be clear about the definitions of the various tasks involved here though, because the example annotation in your figure will be difficult to learn as such. If we're talking about "coreference resolution", you would link "Article 490" to "that Article", as those two noun phrases are referring to the same thing. In a consequent step, you then want to link "paragraph 3 and 5" to "that Article". Taking both steps together, you'll be able to understand that "paragraph 3" belongs to Article 490.
For the first step, annotating "article 490" as being coreferential to "that Article", you can definitely use coref.manual
for the annotation. The way this recipe works, is that it disables all tokens except for nouns, proper nouns and pronouns, which should make your annotation much more efficient. I understand that your custom trained NER model doesn't have a tagger component. What I recommend you do, is compile a new pipeline with your NER model in it, and source
the tagger component from one of our pretrained pipelines, like en_core_web_lg
. Once you've created the config file for this, you can run spacy assemble
to create the full pipeline, that you can then use as input for coref.manual
.
Easier yet, you could also just use en_core_web_lg
as the input model for coref.manual
, as ideally you'd be looking to annotate coref annotations independently of your NER annotations, as that will ensure the model can generalize sufficiently.
Once you have the coref annotations, you'll need to bring in your own architecture to train a model on them. Ideally, the NER and coref components are independently trained on their respective tasks, and only in downstream processing would you combine the results of the two to deduce the information you need.
Alternatively, have a look into coreferee and try out its trained English model - it might help you hit the ground running.
Assuming you have NER and coref annotation, the next step would be to connect the correct snippets in text together. In your example sentence, "paragraphs 3 and 5" could be in a "part_of" relation with "that Article". You could train a REL model for this, but it depends on the grammatical/syntactic variance in your sentences. If most of your cases are pretty straightforward (as they are in this example), you might consider using something like spaCy's Dependency Matcher.
I hope this helps, let me know if you run into issues with any of this!