Corrections on an already annotated NER dataset

Hello everyone,

I have just finished a ner.manual labeling job (to train a new NER model), and I have already saved the results to DB. However and due to last-minute found inconsistencies in the "tagging instructions", there are some errors that I would like to correct. Is there any chance to "reload" the already manually-labeled texts, so I can check and correct the wrongly-labeled terms, while keeping the good ones, without having to start all over again from scratch? Is this possible?

BTW, I have checked all the Quickstart, but I did not find a fitting match for my use case (or so I think).

Thank you!

hi @dave-espinosa!

Are you aware of dataset: prefix you can use for source? "Loading from existing datasets"

For example, let's say your existing dataset is called ner_dataset. However, you found errors in that dataset so you want to now run a recipe (e.g., ner.manual) so it'll pre-highlight your old annotations.

You can then run it like:

python -m prodigy ner.manual new_ner_dataset blank:en dataset:ner_dataset ...

Does this solve your problem?

If so, thanks for the heads up. I think this would be a good tip to add to Quickstart!

1 Like

Hello @ryanwesslen ,

I am re-opening this case, as I have noticed some inconsistencies with the purposed solution.

I will go step by step in my dummy example, so you can either follow those steps and spot some error from my side, or reproduce them and quickly troubleshoot, if that's the case.

First, the data to be used:

  • rawtext.jsonl: Data to be labeled with ner.manual "for first time".
  • patterns.jsonl: Patterns to be used when "reloading" the data.
# rawtext.jsonl
{"text":"I love Python and Java"}
{"text":"I don't know if studying Medicine or Computer Science"}
{"text":"That guy mastered Python in only 3 months! He's a Data Scientist now"}
{"text":"I am not clear... C++ is very complicated!"}
{"text":"I think Julia will eventually overcome Python"}
# patterns.jsonl
{"label": "SKILL", "pattern": "Python"}

Next, the experiment design:

  1. Use rawtext.jsonl with ner.manual to generate labeled_data1 SQLite table (NO patterns used here).
  2. "Reload" labeled_data1 with ner.manual as labeled_data2 AND include --patterns this time.
# Step 1 (Check "Results below")
python3 -m prodigy ner.manual labeled_data1 blank:en ./rawtext.jsonl --label SKILL
# Step 2 (Check "Results below")
python3 -m prodigy ner.manual labeled_data2 blank:en dataset:labeled_data1 --label SKILL --patterns ./testpattern.jsonl

Finally, the results:

  • For step 1, what I labeled looked as follows (notice how I purposedly skipped "Python" in the 1st and 5th documents):

  • For step 2, I expected to see something like this:

But what I got, was the following:

As conclusion, I can see that the --patterns work OK, but dataset: is NOT doing what I expected (i.e., I have to run through the tagging all over again, which is something I want to avoid).

Am I doing something wrong? Or am I understanding something wrong?

Thank you, @dave-espinosa!

First off, huge thank you for your wonderful reproducible example. I was able to diagnose the problem very quickly. I can't thank you enough for the time you put into your questions that helps us to more quickly help you.

It looks like the default behavior is that pattern labels override pre-existing labels. I didn't realize this either until I went into the code base. I found a related post:

Ines mentions that this is intended:

If you run a recipe like ner.manual with patterns and examples with pre-defined "spans" , those spans will be overwritten. That's expected – otherwise, the results would be pretty confusing, you'd constantly have to resolve overlaps between existing spans and matches etc.

The post also gave me an idea to programmatically create a new dataset that uses the pre-existing annotations and also appends the patterns using PatternMatcher.
Then use that new dataset either for training or again in ner.manual to correct.

Here's the script:

from prodigy.components.db import connect
from prodigy.models.matcher import PatternMatcher
import spacy

# spacy model
nlp = spacy.blank("en")

# patterns file
patterns = "testpattern.jsonl"

db = connect()
# existing annotations
examples = db.get_dataset("labeled_data1")

# create pattern_matcher and load patterns from file
pattern_matcher = PatternMatcher(nlp, combine_matches=True, all_examples=True)
pattern_matcher = pattern_matcher.from_disk(patterns)

# assign known patterns
examples_patterns = (eg for _, eg in pattern_matcher(examples))

# loop to combine existing annotations with patterns
combined_examples = []
for eg,eg_p in zip(examples,examples_patterns):

    # need logic to dedup overlapping spans (both pre-existing and pattern)
    seen_tokens = set()
    for entity_match in eg["spans"]:
        # put all entity matches into seen
        seen_tokens.update(range(entity_match["start"], entity_match["end"]))

    for pattern_match in eg_p["spans"]:
        if pattern_match["start"] not in seen_tokens and pattern_match["end"] - 1 not in seen_tokens:
            seen_tokens.update(range(pattern_match["start"], pattern_match["end"]))


db.add_dataset("labeled_data2")  # create a new dataset for combined examples
db.add_examples(combined_examples, ["labeled_data2"]) # load combined examples into this dataset

One thing that was tricky was that I realized I needed to account for any overlapping (double counting) spans that were in the pre-existing and rules. Hence I tried to create logic similar to spaCy's filter_spans that checks whether the span has been included. I couldn't robustly test but it seems to do the trick.

Now try to run that code to create a new dataset labeled_data2 (it has the original labels plus the patterns).

Then rerun ner.manual with dataset:labeled_data2 and without the patterns (since that code puts in the patterns) so now see both the patterns and the original labels:

python -m prodigy ner.manual labeled_data3 blank:en dataset:labeled_data2 --label SKILL

Also you can run print-dataset to get a faster preview:

python -m prodigy print-dataset labeled_data2

Hope this helps!