NER for long string

We are currently working on our citation analysis project, and have questions about moving forward using Prodigy. Basically, we want to create a model that is able to auto identify different citation categories. We have a schema with two levels. So far, we have annotated the top level (Primary, Secondary) with the citations from 3 articles using ner.manual (about 1130 manual citations in the pattern file now).

Yet, after I generated the pattern file based on the 1130 manual annotations and used it with a new article, nothing can be highlighted as a hint. I assume that the pattern file is not effective in recognizing our citations even just on the top level, correct? Also, for giving some more direct information, I am using a few example citations of Book, Law Review and Journals (they are all annotated as Secondary at this point) for your information. As you can see, the citations that we are trying to annotate are pretty long, they are more similar in structure not in meaning.

At this point, I am not sure if ner is the right recipe to continue. I was wondering if you can give us some suggestions regarding moving forward? Shall we annotate more, or shall we try another recipe? Thank you for your thoughts!

Top level: Primary
Second level (6 sub-category), e.g.,

  • Agency regulation
  • Newspaper

Top level: Secondary

Second level (10 sub-category), e.g.

  • Book, e.g.


Pollman, Social and Asocial Enterprise, in THE CAMBRIDGE HANDBOOK OF SOCIAL ENTERPRISE LAW 1, 15 (Benjamin Means & Joseph W. Yockey eds., 2018

  • Law reviews and journals, e.g.

Alicia E. Plerhoples, Social Enterprise as Commitment: A Roadmap, 48 WASH. U. J.L. & POL’Y 89, 104 (2015) [hereinafter Social Enterprise as Commitment]

Anup Malani & Eric A. Posner, The Case for For-Profit Charities, 93 VA. L. REV. 2017, 2064-67 (2007).

Brian Galle, Keep Charity Charitable, 88 TEX. L. REV. 1213, 1214- 15 (2010)

hi @jiebei!

Thanks for your message!

I'm a little confused by what you mean "about 1,130 manual citations in the pattern file".

Are these individual examples of citations? Would these not be annotated examples?

How did you obtain these annotations? Did you use the ner.manual recipe or some other way?

I'll assume that these 1,130 annotations were created by ner.manual for my response below. If I'm not right in assuming that, please let me know.

I can see the challenge for the whole citation being too long. I would add it's not just because they are long, but also because they can be complex (e.g., a lot of punctuation, numbers, etc.).

Typically, patterns are helpful for starting without any annotations. @koaning has a great PyData talk where he shows a workflow for this:

In this case, the matcher (pattern) rules help to provide initial annotations on an unlabeled dataset, which then could be used to train a model.

I suspect that "nothing can be highlighted" because you may have errors in your pattern files. Are you able to test on individual pattern and try to run it through spaCy to confirm it works?

If your 1,130 manual annotations were using ner.manual, I think you may benefit from ignoring patterns and build an initial model and then use "model-in-the-loop" training to improve/add new annotations while improving the model.

Step 1: create a dedicated evaluation dataset

I would recommend you partition your 1,130 manual annotations into a dedicated training and evaluation dataset. You can see this recent post below where I describe why creating a dedicated evaluation set is a good practice when trying to create experiments to improve your model. That post includes a snippet of code that can take an existing Prodigy dataset (let's say it's named dataset), and create two new datasets: train_dataset and eval_dataset. As that post describes, this is important as you keep your evaluation dataset fixed instead of allowing Prodigy to create a new holdout every time your run prodigy train.

Step 2: train an initial ner model

I would then recommend training a ner model and saving the model. I know that you have multiple hierarchies -- which makes it even more challenging -- but I would recommend starting with your top level first.

When you train this model, like that post recommends, you will need to specify both your training data (let's call train_dataset) and your evaluation data (eval_dataset):

python -m prodigy train model_folder --ner train_dataset,eval:eval_dataset

This will save your model to the model_folder folder.

Step 3: use the ner.correct for model predictions, not patterns

Then use the ner.correct model without patterns on additional unlabeled data. The ner.correct is using your ML model, not the patterns, as the initial labels. You will need to provide the location of your model (model_folder).

Once you get your new corrected data, you'll likely want to combine it with your initial training data (train_dataset) by using the db-merge command to create one new "combined" training dataset (initial annotations + newly corrected ones).

Step 4: Retrain your model

With your new combined dataset, try to retrain your full model.

I hope this helps and let us know if you are able to make any progress!

Yes, We used ner.manual, we highlighted 1130 strings (either with Primary or Secondary), so they are individual examples, right?

For the pattern file, I was not sure how to " test on individual pattern and try to run it through spaCy to confirm it works?" as you suggested. I did have patter file issue when I was working on it, this is my original post,, the pattern file that i generated from ner.manal returns UnicodeDecodeError everytime I want to use the pattern file in ner.manual for new annotation tasks. After I used the custom recipe for converting, I can used the updated pattern file with ner.manual, but no highlight appears.

I am attaching my original patter file, and updated pattern file for your information.
cite_pattern_original.jsonl (1.3 MB)
patterns_updated.jsonl (113.0 KB)


Great! Then yes - I think you can avoid patterns. You can follow the steps I described. Ideally, before training, do step 1 to "partition" your dataset into a train_dataset and eval_dataset. This is a best practice and will save you head aches in the future.

Thanks for providing your patterns! Yes, they're not set up correctly. That's why they aren't working. Some of your patterns seem to be raw text examples, not patterns.

Here's a good tutorial on patterns.

For example, this pattern:

{"label": "FRUIT", "pattern": [{"lower": "apple"}]}

Will match any text that like this: “apple”, “APPLE”, “Apple”, “ApPlLe” etc. (that is, that when lowercased, it is "apple".

To do a citation based pattern, will require a lot of combinations to match the complexity of something like:
"Pollman, Social and Asocial Enterprise, in THE CAMBRIDGE HANDBOOK OF SOCIAL ENTERPRISE LAW 1, 15 (Benjamin Means & Joseph W. Yockey eds., 2018"

For example, I found a Stack Overflow post that attempted to do something similar with spaCy matcher rules. There doesn't look like there's a great resolution.

But back to the image I posted from Vincent's video: patterns are helpful before you have any "manual" annotations. If you already have a good set of annotations to start with, which you have, then you'll likely find training and iterating on a machine learning model will due better than rules.

Hi Ryan,
I was wondering how to proceed after step 4? after we have obtained a pretty satisfied model, shall we move to ner.teach?

Also, when I was reading this tutorial, in the example, it uses spacy model "en_core_web_sm". but I am a bit confused if we shall use our own trained model or use "en_core_web_sm" in our case .

For the "source", shall we use new text data or the original text file that we used in the manual annotation stage? The example code uses "/news_headlines.jsonl" in all recipes, Sorry for my very basic questions, and thanks again!

prodigy textcat.teach news_topics en_core_web_sm ./news_headlines.jsonl --label Technology,Politics,Economy,Entertainment

hi @jiebei!

Yes, you can move to ner.teach (i.e., active learning) if you want to improve the model.

However, if you're overall happy with your model's performance, you can add other extensions like spacy-streamlit. This is a streamlit app that can be used to demo your model. This can be extremely helpful to show your model to non-data scientists. You can run it locally or deploy onto a cloud environment.

You'll want to use the model that you previously trained from the manual (and/or correct) annotations. In prodigy train, you need to specify the output_path for your model. Within that model, it saves a model-last (which is the last version of your model) and a model-best (which is the best performing version of your model. You can then use that path output_path/model-last (e.g., if you want your last run) as your model that you'll use.

The example is a case where you want to use spaCy's built-in ner that has multiple different entity types like PERSON or ORG. Since you have custom entities, you need a model that has been trained for those entities.

Likely you'd want to use new text data. If your model has done a good job, likely it has already embedded the information from the manual recipes into the model, and thus already has "learned" from that example. What you want in doing ner.teach is to find blind spots of your model. The ner.teach model can perform active learning that will modify the order examples are given to you to choose to label the ones the model is most uncertain about.

It's important to note that while ner.teach (active learning) in theory makes sense, it doesn't always work in practice. As an alternative, you could instead keep on using the ner.correct recipe which is like the ner.teach applied on new examples, but only makes predictions on the new text, it doesn't reorder the examples.

Here's a great discussion on Matt about active learning:

Hope this helps!