I'm trying to repeat your steps. I started by creating a folder called issue-6037
and moving your files in there with the names news_headlines.jsonl
and news_headlines_small.jsonl
. From there I started annotating via this recipe:
python -m prodigy ner.manual ner_news_headlines blank:en news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION
This is what that interface looks like:
I annotated six examples and I hit the save button. Next, I ran your terms
recipe.
python -m prodigy terms.to-patterns ner_news_headlines --label PERSON,ORG,PRODUCT,LOCATION --spacy-model blank:en > news_pattern.jsonl
This is what my news_pattern.sjonl
file looks like:
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"uber"},{"lower":"\u2019s"},{"lower":"lesson"},{"lower":":"},{"lower":"silicon"},{"lower":"valley"},{"lower":"\u2019s"},{"lower":"start"},{"lower":"-"},{"lower":"up"},{"lower":"machine"},{"lower":"needs"},{"lower":"fixing"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"pearl"},{"lower":"automation"},{"lower":","},{"lower":"founded"},{"lower":"by"},{"lower":"apple"},{"lower":"veterans"},{"lower":","},{"lower":"shuts"},{"lower":"down"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"how"},{"lower":"silicon"},{"lower":"valley"},{"lower":"pushed"},{"lower":"coding"},{"lower":"into"},{"lower":"american"},{"lower":"classrooms"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"women"},{"lower":"in"},{"lower":"tech"},{"lower":"speak"},{"lower":"frankly"},{"lower":"on"},{"lower":"culture"},{"lower":"of"},{"lower":"harassment"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"silicon"},{"lower":"valley"},{"lower":"investors"},{"lower":"flexed"},{"lower":"their"},{"lower":"muscles"},{"lower":"in"},{"lower":"uber"},{"lower":"fight"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"uber"},{"lower":"is"},{"lower":"a"},{"lower":"creature"},{"lower":"of"},{"lower":"an"},{"lower":"industry"},{"lower":"struggling"},{"lower":"to"},{"lower":"grow"},{"lower":"up"}]}
And I think, looking at this file, that the recipe isn't doing what you had hoped it did. Notice how each row has "PERSON,ORG,PRODUCT,LOCATION"
as a label? While this isn't the error message that you're experiencing, I'm assuming that it's related. The terms.to-patterns
recipe is designed to be used together with the terms.teach
recipe, not the ner.manual
one.
This Youtube video helps explain how it's meant to be used.
Custom Recipe
That said, nothing is stopping you from writing a custom script that can turn your previous annotations as terms. Here's a small script that does that.
import srsly
import prodigy
from prodigy.components.db import connect
@prodigy.recipe(
"terms.from-ner",
ner_dataset=("Dataset loader NER annotations from", "positional", None, str),
file_out=("File to write patterns into", "positional", None, str)
)
def custom_recipe(ner_dataset: str, file_out: str):
# Connect to Prodigy database
db = connect()
# Load in annotated examples
annotated = db.get_dataset(ner_dataset)
# Loop over examples
pattern_set = set()
for example in annotated:
for span in example.get("spans", []):
pattern_str = example['text'][span['start']: span['end']]
# Store into tuple, because sets like that
tup = (pattern_str, span['label'])
pattern_set.add(tup)
patterns = [{"pattern": p, "label": l} for p, l in pattern_set]
srsly.write_jsonl(file_out, patterns)
If you're curious how to work with patterns and custom code, you may appreciate the guide in the docs here. When I run this locally via:
python -m prodigy terms.from-ner ner_news_headlines patterns.jsonl -F recipe.py
Then the file patterns.jsonl
contains this:
{"pattern":"Apple","label":"ORG"}
{"pattern":"Silicon Valley","label":"LOCATION"}
{"pattern":"Uber","label":"ORG"}
{"pattern":"Pearl Automation","label":"ORG"}
I can now use these patterns to do ner.manual
.
python -m prodigy ner.manual news_data blank:en news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION --patterns patterns.jsonl
Here's what it looks like:
Note how some entities are pre-labelled but also note that there's now PATTERN metadata in there. This tells you which patterns got activated. I hope this helps!