UnicodeDecodeError:

Hi, I am exploring NER recipe, and got UnicodeDecodeError when I was trying to use the pattern file to do manual annotation. I am using AWS Windows instance.
As you can see, I am using your news_headlines.jsonl file, so I assume there won't be encoding issue?
The first and second lines can run without problem, but I got "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" error when I tried the third line. (ps, news_headlinesCopy.jsonl file is the duplicate file of news_headlines.jsonl to make sure I do not use my own file)

python -m prodigy ner.manual ner_news_headlines blank:en Projects\prodigy\news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION
python -m prodigy terms.to-patterns ner_news_headlines --label PERSON,ORG,PRODUCT,LOCATION --spacy-model blank:en > .\Projects\prodigy\news_pattern.jsonl
python -m prodigy ner.manual news_data blank:en .\Projects\prodigy\news_headlinesCopy.jsonl --label PERSON,ORG,PRODUCT,LOCATION --patterns .\Projects\prodigy\news_pattern.jsonl

I have an update about my error. After I enter the folder Projects/prodigy, my error turns to "prodigy ner.manual: error: the following arguments are required: source". I also looked at the documentation,, it says" Path to text source or - to read from standard input.", but I have already specified the path to text, correct?

Can you help me understand what is going wrong? Thank you!

cd Projects\prodigy
python -m prodigy ner.manual blank:en .\news_headlinesCopy.jsonl --label PERSON,ORG,PRODUCT,LOCATION --patterns .\news_pattern.jsonl

Sorry just to correct the previous question, I forgot one argument in this command, so my question remains to be the first one, the UnicodeDecodeError issue. Thank you!

1 Like

Can you share one example from your news_headlines.jsonlfile that is causing the error? That way I might try to reproduce locally and I might figure out what is going wrong.

Thank you! I have attached two files, news_headlinesCopy.jsonl is just the duplicate file of news_headlines.jsonl with fewer lines.

news_headlines.jsonl (19.2 KB)
news_headlinesCopy.jsonl (381 Bytes)

I'm trying to repeat your steps. I started by creating a folder called issue-6037 and moving your files in there with the names news_headlines.jsonl and news_headlines_small.jsonl. From there I started annotating via this recipe:

python -m prodigy ner.manual ner_news_headlines blank:en news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION

This is what that interface looks like:

I annotated six examples and I hit the save button. Next, I ran your terms recipe.

python -m prodigy terms.to-patterns ner_news_headlines --label PERSON,ORG,PRODUCT,LOCATION --spacy-model blank:en > news_pattern.jsonl

This is what my news_pattern.sjonl file looks like:

{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"uber"},{"lower":"\u2019s"},{"lower":"lesson"},{"lower":":"},{"lower":"silicon"},{"lower":"valley"},{"lower":"\u2019s"},{"lower":"start"},{"lower":"-"},{"lower":"up"},{"lower":"machine"},{"lower":"needs"},{"lower":"fixing"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"pearl"},{"lower":"automation"},{"lower":","},{"lower":"founded"},{"lower":"by"},{"lower":"apple"},{"lower":"veterans"},{"lower":","},{"lower":"shuts"},{"lower":"down"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"how"},{"lower":"silicon"},{"lower":"valley"},{"lower":"pushed"},{"lower":"coding"},{"lower":"into"},{"lower":"american"},{"lower":"classrooms"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"women"},{"lower":"in"},{"lower":"tech"},{"lower":"speak"},{"lower":"frankly"},{"lower":"on"},{"lower":"culture"},{"lower":"of"},{"lower":"harassment"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"silicon"},{"lower":"valley"},{"lower":"investors"},{"lower":"flexed"},{"lower":"their"},{"lower":"muscles"},{"lower":"in"},{"lower":"uber"},{"lower":"fight"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"uber"},{"lower":"is"},{"lower":"a"},{"lower":"creature"},{"lower":"of"},{"lower":"an"},{"lower":"industry"},{"lower":"struggling"},{"lower":"to"},{"lower":"grow"},{"lower":"up"}]}

And I think, looking at this file, that the recipe isn't doing what you had hoped it did. Notice how each row has "PERSON,ORG,PRODUCT,LOCATION" as a label? While this isn't the error message that you're experiencing, I'm assuming that it's related. The terms.to-patterns recipe is designed to be used together with the terms.teach recipe, not the ner.manual one.

This Youtube video helps explain how it's meant to be used.

Custom Recipe

That said, nothing is stopping you from writing a custom script that can turn your previous annotations as terms. Here's a small script that does that.

import srsly 
import prodigy 
from prodigy.components.db import connect

@prodigy.recipe(
    "terms.from-ner",
    ner_dataset=("Dataset loader NER annotations from", "positional", None, str),
    file_out=("File to write patterns into", "positional", None, str)
)
def custom_recipe(ner_dataset: str, file_out: str):
    # Connect to Prodigy database
    db = connect()
    # Load in annotated examples 
    annotated = db.get_dataset(ner_dataset)
    # Loop over examples
    pattern_set = set()
    for example in annotated:
        for span in example.get("spans", []):
            pattern_str = example['text'][span['start']: span['end']]
            # Store into tuple, because sets like that
            tup = (pattern_str, span['label'])
            pattern_set.add(tup)
    patterns = [{"pattern": p, "label": l} for p, l in pattern_set]
    srsly.write_jsonl(file_out, patterns)

If you're curious how to work with patterns and custom code, you may appreciate the guide in the docs here. When I run this locally via:

python -m prodigy terms.from-ner ner_news_headlines patterns.jsonl -F recipe.py

Then the file patterns.jsonl contains this:

{"pattern":"Apple","label":"ORG"}
{"pattern":"Silicon Valley","label":"LOCATION"}
{"pattern":"Uber","label":"ORG"}
{"pattern":"Pearl Automation","label":"ORG"}

I can now use these patterns to do ner.manual.

python -m prodigy ner.manual news_data blank:en news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION --patterns patterns.jsonl

Here's what it looks like:

Note how some entities are pre-labelled but also note that there's now PATTERN metadata in there. This tells you which patterns got activated. I hope this helps!

Thank you for providing such a detailed instructions. I will try it shortly.

Just a quick question regarding the pattern file, I am following this tutorial. It occurs to me that the pattern file looks the same as my news_pattern.jsonl, and it does use terms.to-patterns with ner.manual, right?

prodigy ner.manual ner_fashion_brands en_core_web_sm ./reddit_fashion.jsonl --label FASHION_BRAND --patterns ./fashion_brand_patterns.jsonl

I also have a question about the last command you posted. Are we supposed to use news_headlines_small.jsonl instead of news_headlines.jsonl, I assume that the pattern file is used as an aid (highlight) for annotating a new file (news_headlines_small.jsonl) instead of the old file (news_headlines.jsonl) we did manual annotation at the first place?

python -m prodigy ner.manual news_data blank:en news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION --patterns patterns.jsonl

That URL links to a general docs page, what section are you referring to?

I'm not 100% sure if I follow. The --patterns patterns.jsonl part of the command is what causes the highlighting to happen. You can apply these highlights to either news_*.jsonl file. I think my section doesn't really use the small version anywhere. Or am I mistaken?

Manual annotation with patterns
Maybe my understanding is not correct, but the fashion_brands_patterns.jsonl looks the same as my previous pattern file, and it is used with the ner.manual recipe?

Could you share your patterns file then? I tried following your steps here and noticed your patterns would've looked like this:

{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"uber"},{"lower":"\u2019s"},{"lower":"lesson"},{"lower":":"},{"lower":"silicon"},{"lower":"valley"},{"lower":"\u2019s"},{"lower":"start"},{"lower":"-"},{"lower":"up"},{"lower":"machine"},{"lower":"needs"},{"lower":"fixing"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"pearl"},{"lower":"automation"},{"lower":","},{"lower":"founded"},{"lower":"by"},{"lower":"apple"},{"lower":"veterans"},{"lower":","},{"lower":"shuts"},{"lower":"down"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"how"},{"lower":"silicon"},{"lower":"valley"},{"lower":"pushed"},{"lower":"coding"},{"lower":"into"},{"lower":"american"},{"lower":"classrooms"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"women"},{"lower":"in"},{"lower":"tech"},{"lower":"speak"},{"lower":"frankly"},{"lower":"on"},{"lower":"culture"},{"lower":"of"},{"lower":"harassment"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"silicon"},{"lower":"valley"},{"lower":"investors"},{"lower":"flexed"},{"lower":"their"},{"lower":"muscles"},{"lower":"in"},{"lower":"uber"},{"lower":"fight"}]}
{"label":"PERSON,ORG,PRODUCT,LOCATION","pattern":[{"lower":"uber"},{"lower":"is"},{"lower":"a"},{"lower":"creature"},{"lower":"of"},{"lower":"an"},{"lower":"industry"},{"lower":"struggling"},{"lower":"to"},{"lower":"grow"},{"lower":"up"}]}

This did not seem correct to me because of the label name and token patterns attached.

Yes, it does. I am attaching my file here. Thanks again!
news_pattern.jsonl (7.1 KB)