No start and end of span using data-to-spacy after rel.manual

LBoss · September 7, 2020, 1:57pm

Hi,

I have prodigy v1.10.2. I used ner.manual to annotate some named entities. It was saved in a dataset in the prodigy environment, lets call it "myDataset1". Then I used rel.manual with source dataset:myDataset1 and --label myRel1,myRel2 --span-label myEntity1,myEntity2 with the Entities being the same ones I annotated in before using ner.manual.
Then I annotated myRel1 and myRel2 between Named Entities, and I also changed some annotations of the Named Entities which I previously made. Saved the dataset as "myDataset2".

Now I want to use data-to-spacy command. First I tried

python -m prodigy data-to-spacy my\path\train.json my\path\eval.json --lang de --ner myDataset2 --parser myDataset2 --eval-split 0.3

Which returned me this error:

✘ Invalid data for component 'ner'
spans -> 7 -> start field required
spans -> 7 -> end field required

This error was also previously mentioned here. I exported the dataset using db-out and that worked. I checked the data and indeed, there was one sample that had annotated only {'label': 'myEntity1'} without 'text', 'start', 'end',...

I also tried to create the spacy-datasets seperately, which worked for the Parser:

python -m prodigy data-to-spacy my\path\train.json my\path\eval.json --lang de --parser myDataset2 --eval-split 0.3

But gave me the same error for the Ner using:

python -m prodigy data-to-spacy my\path\train.json my\path\eval.json --lang de --ner myDataset2 --eval-split 0.3

How can I find and exclude that one sample from myDataset2, so that I can try data-to-spacy again? Thanks for the help.

ines · September 7, 2020, 7:45pm

Hi! Do you know which version of Prodigy you created the annotations with? The same version you're currently using? It seems like for some reason, you ended up with an invalid span here.

The easiest way to find and exclude it would be to just go over your data, check the spans and if they include a start/end and only keep the valid spans in a new dataset:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("myDataset2")
filtered_examples = []
for eg in examples:
    if "spans" in eg:
        new_spans = []
        for span in eg["spans"]:
            if "start" not in span or "end" not in span:
                print("Found bad span:", span)
            else:
                new_spans.append(span)
        eg["spans"] = new_spans
    filtered_examples.append(eg)

# Add filtered examples to new dataset
db.add_dataset("myDataset2_filtered")
db.add_examples(filtered_examples, ["myDataset2_filtered"])

LBoss · September 8, 2020, 9:37am

Thanks for your response!

Yes it was the same version, v1.10.2.

This works perfectly, thank you very much!

vinitrinh · May 4, 2021, 8:51am

Hi I'm enountering a similar error. Using rel.manual, I occassionally get invalid spans from the annotations. A span is expected to be the following:
{"start":42,"end":63,"token_start":9,"token_end":12,"label":"CAPAB"}
But there exists random span entries hidden in the span annotations:
{"label":"CAPAB"}
Seeing if I can replicate the error.

For now, intending to use the above quick fix.

ver: 1.10.8

ines · May 5, 2021, 9:57am

If you can find a reproducible example/workflow, that would be super helpful It seems like there's some issue with how spans are added/removed that only occurs under certain conditions, given that it hasn't come up a lot.

Topic		Replies	Views
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
rel.manual to train ner and dependency ner , done , solved , dep , relations	15	2048	September 7, 2020
merging a data annotated by regex with the annotated data by prodigy usage , ner , spacy	1	483	August 7, 2019
Getting Started Questions usage , ner	1	631	November 6, 2018
Skip mismatched tokenization? usage , ner , spacy , solved	2	395	February 8, 2022

No start and end of span using data-to-spacy after rel.manual

Related topics