NER for long string

hi @jiebei!

Thanks for your message!

I'm a little confused by what you mean "about 1,130 manual citations in the pattern file".

Are these individual examples of citations? Would these not be annotated examples?

How did you obtain these annotations? Did you use the ner.manual recipe or some other way?

I'll assume that these 1,130 annotations were created by ner.manual for my response below. If I'm not right in assuming that, please let me know.

I can see the challenge for the whole citation being too long. I would add it's not just because they are long, but also because they can be complex (e.g., a lot of punctuation, numbers, etc.).

Typically, patterns are helpful for starting without any annotations. @koaning has a great PyData talk where he shows a workflow for this:

In this case, the matcher (pattern) rules help to provide initial annotations on an unlabeled dataset, which then could be used to train a model.

I suspect that "nothing can be highlighted" because you may have errors in your pattern files. Are you able to test on individual pattern and try to run it through spaCy to confirm it works?

If your 1,130 manual annotations were using ner.manual, I think you may benefit from ignoring patterns and build an initial model and then use "model-in-the-loop" training to improve/add new annotations while improving the model.

Step 1: create a dedicated evaluation dataset

I would recommend you partition your 1,130 manual annotations into a dedicated training and evaluation dataset. You can see this recent post below where I describe why creating a dedicated evaluation set is a good practice when trying to create experiments to improve your model. That post includes a snippet of code that can take an existing Prodigy dataset (let's say it's named dataset), and create two new datasets: train_dataset and eval_dataset. As that post describes, this is important as you keep your evaluation dataset fixed instead of allowing Prodigy to create a new holdout every time your run prodigy train.

Step 2: train an initial ner model

I would then recommend training a ner model and saving the model. I know that you have multiple hierarchies -- which makes it even more challenging -- but I would recommend starting with your top level first.

When you train this model, like that post recommends, you will need to specify both your training data (let's call train_dataset) and your evaluation data (eval_dataset):

python -m prodigy train model_folder --ner train_dataset,eval:eval_dataset

This will save your model to the model_folder folder.

Step 3: use the ner.correct for model predictions, not patterns

Then use the ner.correct model without patterns on additional unlabeled data. The ner.correct is using your ML model, not the patterns, as the initial labels. You will need to provide the location of your model (model_folder).

Once you get your new corrected data, you'll likely want to combine it with your initial training data (train_dataset) by using the db-merge command to create one new "combined" training dataset (initial annotations + newly corrected ones).

Step 4: Retrain your model

With your new combined dataset, try to retrain your full model.

I hope this helps and let us know if you are able to make any progress!