data-to-spacy losing annotations

ewencb · October 26, 2023, 3:41pm

Hello

I have annotated 1,000 documents one label at a time for 16 labels and have each label as a separate dataset in prodigy (on the same 1,000 documents). For some reason when I run data-to-spacy on all of those datasets I am losing all but 2 of the labels. I also tried to run db-merge first but that has the same result. I tried running prodigy train and the same thing happens.

Could you let me know why this might be happening? I'm assuming it's to do with the process by which prodigy merges the annotations on the same documents.

Thanks!

Chris

ryanwesslen · October 26, 2023, 4:08pm

hi @ewencb!

Thanks for your question and welcome to the Prodigy community

I'm wondering if you may be running into an issue with hashing, i.e., let's say 1 doc has multiple entities, data-to-spacy may only take the first entity due to each document unique hash.

Could you try adding --rehash which does a force-update all hashes assigned to examples?

If this doesn't work, the team can dig in more.

One last tip: you can also look underneath and see the source code for any built in recipes, whether it is db-merge, data-to-spacy, or any task recipes like ner.manual. Just run prodigy stats then find the Location: of where your Prodigy library is saved. Go to that folder, then look for the recipes folder. You'll then find all of the built-in recipes as Python scripts. It can be messy, but this is just a tip in case you want to look underneath.

ewencb · October 26, 2023, 4:22pm

Thanks for the reply. I've just tried db-merge with --rehash and then data-to-spacy on the merged dataset but no difference to the output, I still end up with only 2/16 labels, I then tried combining just 2 datasets and I only get 1 label in the output .spacy files. I tried to look at review with all the datasets and it looks like I can only choose one annotation session/label for each document, if that helps explain something?

koaning · October 30, 2023, 9:43am

Hi Christopher.

I may have found the issue on our end, but before explaining it in more detail I figured I'd also ask for some extra information since it may help me debug/understand your problem a bit better. Could you share the call to prodigy -m data-to-spacy? I'm mainly interested in understanding the task that you're training for.

That said, I think the issue is that our training scripts use the _input_hash as a definition of a unique example. If there is only one label to consider, this is fine. But once there are multiple datasets that each have their own label you'd want to use the task_hash instead. I may be glancing over a detail here, so this is something I want to pick up with a colleague, but my gut is thinking that this is the issue.

To unblock you, I think this script would work.

from prodigy.components.db import connect 

# Fetch the one dataset that has all your examples, the one you
# created with the db-merge command
db = connect()
old_examples = db.get_dataset_examples("<old-dataset-name>")

# Now, we'll manually replace the input_hash with the task_hash
updated_examples = [{**ex, '_input_hash': ex['_task_hash']} for ex in old_examples]
db.add_examples(updated_examples, "<new-dataset-name>")

Could you try running the data-to-spacy command on "<new-dataset-name>"? My gut says that should unblock you, but I'll gladly hear it if that's not the case.

ewencb · October 30, 2023, 8:19pm

Unfortunately I get the same result I had a look into the data_utils.py in prodigy to see if I could work out what might be going wrong but it was too much to get my head round.

Each dataset should contain the same 1000 examples but with a different label annotated in each one so the command is like:

prodigy data-to-spacy merged_corpus --ner label1,label2,label3,label4,label5 --config assets/config.cfg --base-model assets/base_model

Using the script you sent this is what I tried to do (I'm using spacy projects so wrapped it up in a command line script) but got the same result:

import typer
from prodigy.components.db import connect
from prodigy.recipes.train import data_to_spacy
from prodigy.recipes.commands import db_merge


def merge_datasets(output_dir: str,
                   eval_split: float,
                   config: str,
                   base_model: str):
    db = connect()
    datasets = db.datasets
    db_merge(in_sets=datasets, out_set='combined', rehash=True)
    combined_examples = db.get_dataset_examples('combined')
    merged_examples = [
        {**ex, '_input_hash': ex['_task_hash']} for ex in combined_examples
    ]
    db.add_dataset('merged')
    db.add_examples(merged_examples, ('merged',))
    data_to_spacy(output_dir=output_dir,
                  ner='merged',
                  eval_split=eval_split,
                  base_model=base_model,
                  verbose=True)


if __name__ == "__main__":
    typer.run(merge_datasets)

koaning · October 31, 2023, 2:12pm

Let me try something else then. First, I'll try to recreate your situation by annotating this data.

{"text": "hello my name is james"}
{"text": "hello my name is john"}
{"text": "hello my name is robert"}
{"text": "hello my name is michael"}
{"text": "hello my name is william"}
{"text": "hello my name is mary"}
{"text": "hello my name is david"}
{"text": "hello my name is richard"}
{"text": "hello my name is joseph"}

I'm doing these two recipe calls to generate two datasets. One with names, the other with greetings.

python -m prodigy ner.manual issue-6864-names blank:en examples.jsonl --label name
python -m prodigy ner.manual issue-6864-greeting blank:en examples.jsonl --label greeting

The interfaces look like this.

For names

For greetings

So that means that right now I have a dataset titled issue-6854-names and another issue-6864-greeting that share input hashes but still have a different label attached. This is confirmed by db-out.

However, I noticed something interesting in the db-out calls. I'm listing the last item from both sets.

From `python -m prodigy db-out issue-6864-greeting`:

{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"_is_binary":false,"tokens":[{"text":"hello","start":0,"end":5,"id":0,"ws":true},{"text":"my","start":6,"end":8,"id":1,"ws":true},{"text":"name","start":9,"end":13,"id":2,"ws":true},{"text":"is","start":14,"end":16,"id":3,"ws":true},{"text":"joseph","start":17,"end":23,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":0,"end":5,"token_start":0,"token_end":0,"label":"greeting"}],"answer":"accept","_timestamp":1698746894,"_annotator_id":"2023-10-31_11-08-03","_session_id":"2023-10-31_11-08-03"}

From `python -m prodigy db-out issue-6864-namess`:

{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"_is_binary":false,"tokens":[{"text":"hello","start":0,"end":5,"id":0,"ws":true},{"text":"my","start":6,"end":8,"id":1,"ws":true},{"text":"name","start":9,"end":13,"id":2,"ws":true},{"text":"is","start":14,"end":16,"id":3,"ws":true},{"text":"joseph","start":17,"end":23,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":17,"end":23,"token_start":4,"token_end":4,"label":"name"}],"answer":"accept","_timestamp":1698746872,"_annotator_id":"2023-10-31_11-07-38","_session_id":"2023-10-31_11-07-38"}

Notice how the task hashes are the same too? That's why my "trick" didn't work. I would've expected them to differ, but I'll explore that later because I want to unblock you first.

Might a script like this work instead? I'm thinking that we ignore the hashes for now and just manually make sure that each example has the appropriate spans attached.

from prodigy.components.db import connect 

# Fetch the one dataset that has all your examples, the one you
# created with the db-merge command
db = connect()
dataset_names = ["issue-6864-names", "issue-6864-greeting"]
old_examples = []
for name in dataset_names:
    old_examples.extend(db.get_dataset_examples(name))

def dedup_spans(example):
    key_values = {}
    for span in example['spans']:
        key = (span['start'], span['end'], span['label'])
        if key not in key_values:
            key_values[key] = []
        key_values[key] = span
    return [item for item in key_values.values()]

def merge(examples):
    """We need an extra function because dictionaries aren't hashable."""
    key_values = {}
    for ex in examples:
        if ex['_input_hash'] not in key_values:
            key_values[ex['_input_hash']] = ex
        else:
            key_values[ex['_input_hash']]['spans'].extend(ex['spans'])
    for ex in examples:
        ex['spans'] = dedup_spans(ex)
    return examples

# These new examples should now have the spans merged and deduplicated.
new_examples = (merge(old_examples))

Let me know! I'll gladly help you get unblocked if this doesn't work. On my end it seems that if I run this I do get examples that look like they contain all the spans that were annotated.

This is what the final example in new_examples looks like on my end.

{'text': 'hello my name is james', '_input_hash': -1294982232, '_task_hash': 465224705, '_is_binary': False, 'tokens': [{'text': 'hello', 'start': 0, 'end': 5, 'id': 0, 'ws': True}, {'text': 'my', 'start': 6, 'end': 8, 'id': 1, 'ws': True}, {'text': 'name', 'start': 9, 'end': 13, 'id': 2, 'ws': True}, {'text': 'is', 'start': 14, 'end': 16, 'id': 3, 'ws': True}, {'text': 'james', 'start': 17, 'end': 22, 'id': 4, 'ws': False}], '_view_id': 'ner_manual', 'spans': [{'start': 17, 'end': 22, 'token_start': 4, 'token_end': 4, 'label': 'name'}, {'start': 0, 'end': 5, 'token_start': 0, 'token_end': 0, 'label': 'greeting'}], 'answer': 'accept', '_timestamp': 1698746861, '_annotator_id': '2023-10-31_11-07-38', '_session_id': '2023-10-31_11-07-38'}

ewencb · October 31, 2023, 6:03pm

Thank you! I had to modify merge function to deal with the fact that not every example has a span for a label:

def merge(examples):
    key_values = {}
    for ex in examples:
        key = ex['_input_hash']
        if key in key_values:
            if 'spans' in ex:
                if 'spans' in key_values[key]:
                    key_values[key]['spans'].extend(ex['spans'])
                else:
                    key_values[key]['spans'] = ex['spans']
        else:
            key_values[key] = ex
    return examples

I am now unblocked

koaning · November 1, 2023, 9:10am

Happy to hear it. But yeah, I've made an internal ticket to discuss this. Our training recipes make a bunch of assumptions and this issue serves as a nice reminder that they may not always hold.

If you get stuck again, do let me know!

kylebigelow · December 7, 2023, 12:05am

This might be a good segway for an annotation best practices refresher. For NER/spancat, I have always annotated with all of my labels. This is the slowest method and I'm sure there is a faster way.

mv3 · December 7, 2023, 2:27am

Although this may classify as another topic entirely, I'm also interested in @kylebigelow 's suggestion.

For my use case, assuming I have entities A, B, C, D and E that are connected by certain tokens, e.g. a comma ,, the conjunctives like and &, what would be the best way to consider all entities part of a certain span group?

At present it seems like hand-rolling a pattern for EntityRuler (for it to match A) and using repetitive sub-patterns of this to match a SpanRuler that would embrace something like A & B of the C involves a lot of redundant work.

To remove the abstraction and use a concrete example in the legal domain, assuming Sec. 1, Sec. 2, and Sec. 3 as part of The Law of Spacy, how would all entities be combined together to form a single long span? Put another way, is there an easier way to connect entities and consider these to be a span separate from being their being short entities?

magdaaniol · January 3, 2024, 11:32am

Hi @mv3,

Sorry for the delayed reply! Picking up from your concrete example I assume that you'd like the whole span Sec. 1, Sec.2, and Sec.3 as part of the Law of Spacy to be a match?
One thing you could try is to introduce one extra step and try labelling the connectors with rules like:
{"label": "CONNECTOR", "pattern": {"LOWER": {"IN": [",","and", "as part of"]}}}

Then, in the following step your could have rules to match the consecutive entities which should match the entire span you're after:
{"ENT_TYPE": {"IN": ["ENTITY", "CONNECTING"]}, "OP": "+"}
You probably also want to filter out the longest matches, which in span ruler you should be able to do with filtering.

mv3 · January 7, 2024, 6:28pm

Hi @magdaaniol, thank you for the reply! Yes I've been using a variant of this approach by merging span ranges that consist of consecutive labels, it looks something like this:

def __call__(self, doc: Doc) -> Doc:
        base_ranges: list[tuple[int, int]] = []
        for label in self.labels:
            base_ranges.extend([(s.start, s.end) for s in filter_spans(doc.spans["sc"]) if s.label_ == label])

        link_ranges: list[tuple[int, int]] = []
        for label in ("link", "caption"):
            base_ranges.extend([(s.start, s.end) for s in filter_spans(doc.spans["ruler"]) if s.label_ == label])

        if ranges := base_ranges + link_ranges:
            doc.spans[self.spans_key] = list(
                self.get_spans(doc, merged=self.merge_ranges(ranges), in_range=base_ranges)
            )

Took me awhile to move away from the NER camp and just shift to spans. Interested in best practices of this nature. I find that many of the Github discussions/issues and this forum have great gems worth bookmarking.

Topic		Replies	Views
Losing spancat labels when training after using prodigy db-merge spacy , spancat	12	339	January 3, 2024
Data annotation : Query Regarding Data Annotation and Merging in Prodigy ner	1	18	January 10, 2025
Review into the same dataset (v1.11.04a) usage , review	1	446	March 12, 2021
Training Multiple entities at the Same time? ner , spacy , solved	11	3177	November 27, 2018
combining multiple models and exporting training data to spacy ner , spacy	3	2881	November 13, 2018

data-to-spacy losing annotations

For names

For greetings

From python -m prodigy db-out issue-6864-greeting:

From python -m prodigy db-out issue-6864-namess:

Related topics

From `python -m prodigy db-out issue-6864-greeting`:

From `python -m prodigy db-out issue-6864-namess`: