No start and end of span using data-to-spacy after rel.manual

Hi,

I have prodigy v1.10.2. I used ner.manual to annotate some named entities. It was saved in a dataset in the prodigy environment, lets call it "myDataset1". Then I used rel.manual with source dataset:myDataset1 and --label myRel1,myRel2 --span-label myEntity1,myEntity2 with the Entities being the same ones I annotated in before using ner.manual.
Then I annotated myRel1 and myRel2 between Named Entities, and I also changed some annotations of the Named Entities which I previously made. Saved the dataset as "myDataset2".

Now I want to use data-to-spacy command. First I tried

python -m prodigy data-to-spacy my\path\train.json my\path\eval.json --lang de --ner myDataset2 --parser myDataset2 --eval-split 0.3

Which returned me this error:

✘ Invalid data for component 'ner'
spans -> 7 -> start field required
spans -> 7 -> end field required

This error was also previously mentioned here. I exported the dataset using db-out and that worked. I checked the data and indeed, there was one sample that had annotated only {'label': 'myEntity1'} without 'text', 'start', 'end',...

I also tried to create the spacy-datasets seperately, which worked for the Parser:

python -m prodigy data-to-spacy my\path\train.json my\path\eval.json --lang de --parser myDataset2 --eval-split 0.3

But gave me the same error for the Ner using:

python -m prodigy data-to-spacy my\path\train.json my\path\eval.json --lang de --ner myDataset2 --eval-split 0.3

How can I find and exclude that one sample from myDataset2, so that I can try data-to-spacy again? Thanks for the help.

Hi! Do you know which version of Prodigy you created the annotations with? The same version you're currently using? It seems like for some reason, you ended up with an invalid span here.

The easiest way to find and exclude it would be to just go over your data, check the spans and if they include a start/end and only keep the valid spans in a new dataset:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("myDataset2")
filtered_examples = []
for eg in examples:
    if "spans" in eg:
        new_spans = []
        for span in eg["spans"]:
            if "start" not in span or "end" not in span:
                print("Found bad span:", span)
            else:
                new_spans.append(span)
        eg["spans"] = new_spans
    filtered_examples.append(eg)

# Add filtered examples to new dataset
db.add_dataset("myDataset2_filtered")
db.add_examples(filtered_examples, ["myDataset2_filtered"])
2 Likes

Thanks for your response!

Yes it was the same version, v1.10.2.

This works perfectly, thank you very much!

Hi I'm enountering a similar error. Using rel.manual, I occassionally get invalid spans from the annotations. A span is expected to be the following:
{"start":42,"end":63,"token_start":9,"token_end":12,"label":"CAPAB"}
But there exists random span entries hidden in the span annotations:
{"label":"CAPAB"}
Seeing if I can replicate the error.

For now, intending to use the above quick fix.

ver: 1.10.8

1 Like

If you can find a reproducible example/workflow, that would be super helpful :pray: It seems like there's some issue with how spans are added/removed that only occurs under certain conditions, given that it hasn't come up a lot.