hi @jiebei!
Thanks for your message!
I'm a little confused by what you mean "about 1,130 manual citations in the pattern file".
Are these individual examples of citations? Would these not be annotated examples?
How did you obtain these annotations? Did you use the ner.manual
recipe or some other way?
I'll assume that these 1,130 annotations were created by ner.manual
for my response below. If I'm not right in assuming that, please let me know.
I can see the challenge for the whole citation being too long. I would add it's not just because they are long, but also because they can be complex (e.g., a lot of punctuation, numbers, etc.).
Typically, patterns are helpful for starting without any annotations. @koaning has a great PyData talk where he shows a workflow for this:
In this case, the matcher (pattern) rules help to provide initial annotations on an unlabeled dataset, which then could be used to train a model.
I suspect that "nothing can be highlighted" because you may have errors in your pattern files. Are you able to test on individual pattern and try to run it through spaCy to confirm it works?
If your 1,130 manual annotations were using ner.manual
, I think you may benefit from ignoring patterns and build an initial model and then use "model-in-the-loop" training to improve/add new annotations while improving the model.
Step 1: create a dedicated evaluation dataset
I would recommend you partition your 1,130 manual annotations into a dedicated training and evaluation dataset. You can see this recent post below where I describe why creating a dedicated evaluation set is a good practice when trying to create experiments to improve your model. That post includes a snippet of code that can take an existing Prodigy dataset (let's say it's named dataset
), and create two new datasets: train_dataset
and eval_dataset
. As that post describes, this is important as you keep your evaluation dataset fixed instead of allowing Prodigy to create a new holdout every time your run prodigy train
.
Step 2: train an initial ner
model
I would then recommend training a ner
model and saving the model. I know that you have multiple hierarchies -- which makes it even more challenging -- but I would recommend starting with your top level first.
When you train this model, like that post recommends, you will need to specify both your training data (let's call train_dataset
) and your evaluation data (eval_dataset
):
python -m prodigy train model_folder --ner train_dataset,eval:eval_dataset
This will save your model to the model_folder
folder.
Step 3: use the ner.correct
for model predictions, not patterns
Then use the ner.correct
model without patterns on additional unlabeled data. The ner.correct
is using your ML model, not the patterns, as the initial labels. You will need to provide the location of your model (model_folder
).
Once you get your new corrected data, you'll likely want to combine it with your initial training data (train_dataset
) by using the db-merge
command to create one new "combined" training dataset (initial annotations + newly corrected ones).
Step 4: Retrain your model
With your new combined dataset, try to retrain your full model.
I hope this helps and let us know if you are able to make any progress!