data-to-spacy losing annotations


I have annotated 1,000 documents one label at a time for 16 labels and have each label as a separate dataset in prodigy (on the same 1,000 documents). For some reason when I run data-to-spacy on all of those datasets I am losing all but 2 of the labels. I also tried to run db-merge first but that has the same result. I tried running prodigy train and the same thing happens.

Could you let me know why this might be happening? I'm assuming it's to do with the process by which prodigy merges the annotations on the same documents.



hi @ewencb!

Thanks for your question and welcome to the Prodigy community :wave:

I'm wondering if you may be running into an issue with hashing, i.e., let's say 1 doc has multiple entities, data-to-spacy may only take the first entity due to each document unique hash.

Could you try adding --rehash which does a force-update all hashes assigned to examples?

If this doesn't work, the team can dig in more.

One last tip: you can also look underneath and see the source code for any built in recipes, whether it is db-merge, data-to-spacy, or any task recipes like ner.manual. Just run prodigy stats then find the Location: of where your Prodigy library is saved. Go to that folder, then look for the recipes folder. You'll then find all of the built-in recipes as Python scripts. It can be messy, but this is just a tip in case you want to look underneath.

Thanks for the reply. I've just tried db-merge with --rehash and then data-to-spacy on the merged dataset but no difference to the output, I still end up with only 2/16 labels, I then tried combining just 2 datasets and I only get 1 label in the output .spacy files. I tried to look at review with all the datasets and it looks like I can only choose one annotation session/label for each document, if that helps explain something?

Hi Christopher.

I may have found the issue on our end, but before explaining it in more detail I figured I'd also ask for some extra information since it may help me debug/understand your problem a bit better. Could you share the call to prodigy -m data-to-spacy? I'm mainly interested in understanding the task that you're training for.

That said, I think the issue is that our training scripts use the _input_hash as a definition of a unique example. If there is only one label to consider, this is fine. But once there are multiple datasets that each have their own label you'd want to use the task_hash instead. I may be glancing over a detail here, so this is something I want to pick up with a colleague, but my gut is thinking that this is the issue.

To unblock you, I think this script would work.

from prodigy.components.db import connect 

# Fetch the one dataset that has all your examples, the one you
# created with the db-merge command
db = connect()
old_examples = db.get_dataset_examples("<old-dataset-name>")

# Now, we'll manually replace the input_hash with the task_hash
updated_examples = [{**ex, '_input_hash': ex['_task_hash']} for ex in old_examples]
db.add_examples(updated_examples, "<new-dataset-name>")

Could you try running the data-to-spacy command on "<new-dataset-name>"? My gut says that should unblock you, but I'll gladly hear it if that's not the case.

Unfortunately I get the same result :frowning: I had a look into the in prodigy to see if I could work out what might be going wrong but it was too much to get my head round.

Each dataset should contain the same 1000 examples but with a different label annotated in each one so the command is like:

prodigy data-to-spacy merged_corpus --ner label1,label2,label3,label4,label5 --config assets/config.cfg --base-model assets/base_model

Using the script you sent this is what I tried to do (I'm using spacy projects so wrapped it up in a command line script) but got the same result:

import typer
from prodigy.components.db import connect
from import data_to_spacy
from import db_merge

def merge_datasets(output_dir: str,
                   eval_split: float,
                   config: str,
                   base_model: str):
    db = connect()
    datasets = db.datasets
    db_merge(in_sets=datasets, out_set='combined', rehash=True)
    combined_examples = db.get_dataset_examples('combined')
    merged_examples = [
        {**ex, '_input_hash': ex['_task_hash']} for ex in combined_examples
    db.add_examples(merged_examples, ('merged',))

if __name__ == "__main__":

Let me try something else then. First, I'll try to recreate your situation by annotating this data.

{"text": "hello my name is james"}
{"text": "hello my name is john"}
{"text": "hello my name is robert"}
{"text": "hello my name is michael"}
{"text": "hello my name is william"}
{"text": "hello my name is mary"}
{"text": "hello my name is david"}
{"text": "hello my name is richard"}
{"text": "hello my name is joseph"}

I'm doing these two recipe calls to generate two datasets. One with names, the other with greetings.

python -m prodigy ner.manual issue-6864-names blank:en examples.jsonl --label name
python -m prodigy ner.manual issue-6864-greeting blank:en examples.jsonl --label greeting

The interfaces look like this.

For names

For greetings

So that means that right now I have a dataset titled issue-6854-names and another issue-6864-greeting that share input hashes but still have a different label attached. This is confirmed by db-out.

However, I noticed something interesting in the db-out calls. I'm listing the last item from both sets.

From python -m prodigy db-out issue-6864-greeting:

{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"_is_binary":false,"tokens":[{"text":"hello","start":0,"end":5,"id":0,"ws":true},{"text":"my","start":6,"end":8,"id":1,"ws":true},{"text":"name","start":9,"end":13,"id":2,"ws":true},{"text":"is","start":14,"end":16,"id":3,"ws":true},{"text":"joseph","start":17,"end":23,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":0,"end":5,"token_start":0,"token_end":0,"label":"greeting"}],"answer":"accept","_timestamp":1698746894,"_annotator_id":"2023-10-31_11-08-03","_session_id":"2023-10-31_11-08-03"}

From python -m prodigy db-out issue-6864-namess:

{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"_is_binary":false,"tokens":[{"text":"hello","start":0,"end":5,"id":0,"ws":true},{"text":"my","start":6,"end":8,"id":1,"ws":true},{"text":"name","start":9,"end":13,"id":2,"ws":true},{"text":"is","start":14,"end":16,"id":3,"ws":true},{"text":"joseph","start":17,"end":23,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":17,"end":23,"token_start":4,"token_end":4,"label":"name"}],"answer":"accept","_timestamp":1698746872,"_annotator_id":"2023-10-31_11-07-38","_session_id":"2023-10-31_11-07-38"}

Notice how the task hashes are the same too? That's why my "trick" didn't work. I would've expected them to differ, but I'll explore that later because I want to unblock you first.

Might a script like this work instead? I'm thinking that we ignore the hashes for now and just manually make sure that each example has the appropriate spans attached.

from prodigy.components.db import connect 

# Fetch the one dataset that has all your examples, the one you
# created with the db-merge command
db = connect()
dataset_names = ["issue-6864-names", "issue-6864-greeting"]
old_examples = []
for name in dataset_names:

def dedup_spans(example):
    key_values = {}
    for span in example['spans']:
        key = (span['start'], span['end'], span['label'])
        if key not in key_values:
            key_values[key] = []
        key_values[key] = span
    return [item for item in key_values.values()]

def merge(examples):
    """We need an extra function because dictionaries aren't hashable."""
    key_values = {}
    for ex in examples:
        if ex['_input_hash'] not in key_values:
            key_values[ex['_input_hash']] = ex
    for ex in examples:
        ex['spans'] = dedup_spans(ex)
    return examples

# These new examples should now have the spans merged and deduplicated.
new_examples = (merge(old_examples))

Let me know! I'll gladly help you get unblocked if this doesn't work. On my end it seems that if I run this I do get examples that look like they contain all the spans that were annotated.

This is what the final example in new_examples looks like on my end.

{'text': 'hello my name is james', '_input_hash': -1294982232, '_task_hash': 465224705, '_is_binary': False, 'tokens': [{'text': 'hello', 'start': 0, 'end': 5, 'id': 0, 'ws': True}, {'text': 'my', 'start': 6, 'end': 8, 'id': 1, 'ws': True}, {'text': 'name', 'start': 9, 'end': 13, 'id': 2, 'ws': True}, {'text': 'is', 'start': 14, 'end': 16, 'id': 3, 'ws': True}, {'text': 'james', 'start': 17, 'end': 22, 'id': 4, 'ws': False}], '_view_id': 'ner_manual', 'spans': [{'start': 17, 'end': 22, 'token_start': 4, 'token_end': 4, 'label': 'name'}, {'start': 0, 'end': 5, 'token_start': 0, 'token_end': 0, 'label': 'greeting'}], 'answer': 'accept', '_timestamp': 1698746861, '_annotator_id': '2023-10-31_11-07-38', '_session_id': '2023-10-31_11-07-38'}
1 Like

Thank you! I had to modify merge function to deal with the fact that not every example has a span for a label:

def merge(examples):
    key_values = {}
    for ex in examples:
        key = ex['_input_hash']
        if key in key_values:
            if 'spans' in ex:
                if 'spans' in key_values[key]:
                    key_values[key]['spans'] = ex['spans']
            key_values[key] = ex
    return examples

I am now unblocked :slight_smile:


Happy to hear it. But yeah, I've made an internal ticket to discuss this. Our training recipes make a bunch of assumptions and this issue serves as a nice reminder that they may not always hold.

If you get stuck again, do let me know!


This might be a good segway for an annotation best practices refresher. For NER/spancat, I have always annotated with all of my labels. This is the slowest method and I'm sure there is a faster way.

1 Like

Although this may classify as another topic entirely, I'm also interested in @kylebigelow 's suggestion.

For my use case, assuming I have entities A, B, C, D and E that are connected by certain tokens, e.g. a comma ,, the conjunctives like and &, what would be the best way to consider all entities part of a certain span group?

At present it seems like hand-rolling a pattern for EntityRuler (for it to match A) and using repetitive sub-patterns of this to match a SpanRuler that would embrace something like A & B of the C involves a lot of redundant work.

To remove the abstraction and use a concrete example in the legal domain, assuming Sec. 1, Sec. 2, and Sec. 3 as part of The Law of Spacy, how would all entities be combined together to form a single long span? Put another way, is there an easier way to connect entities and consider these to be a span separate from being their being short entities?

Hi @mv3,

Sorry for the delayed reply! Picking up from your concrete example I assume that you'd like the whole span Sec. 1, Sec.2, and Sec.3 as part of the Law of Spacy to be a match?
One thing you could try is to introduce one extra step and try labelling the connectors with rules like:
{"label": "CONNECTOR", "pattern": {"LOWER": {"IN": [",","and", "as part of"]}}}

Then, in the following step your could have rules to match the consecutive entities which should match the entire span you're after:
{"ENT_TYPE": {"IN": ["ENTITY", "CONNECTING"]}, "OP": "+"}
You probably also want to filter out the longest matches, which in span ruler you should be able to do with filtering.

Hi @magdaaniol, thank you for the reply! Yes I've been using a variant of this approach by merging span ranges that consist of consecutive labels, it looks something like this:

def __call__(self, doc: Doc) -> Doc:
        base_ranges: list[tuple[int, int]] = []
        for label in self.labels:
            base_ranges.extend([(s.start, s.end) for s in filter_spans(doc.spans["sc"]) if s.label_ == label])

        link_ranges: list[tuple[int, int]] = []
        for label in ("link", "caption"):
            base_ranges.extend([(s.start, s.end) for s in filter_spans(doc.spans["ruler"]) if s.label_ == label])

        if ranges := base_ranges + link_ranges:
            doc.spans[self.spans_key] = list(
                self.get_spans(doc, merged=self.merge_ranges(ranges), in_range=base_ranges)

Took me awhile to move away from the NER camp and just shift to spans. Interested in best practices of this nature. I find that many of the Github discussions/issues and this forum have great gems worth bookmarking.

1 Like