Manual Span annotations seemingly disappearing when converting to spacy

hi @Amandine_Lbs!

Thanks for your question and welcome to the Prodigy community :wave:

Sorry to hear about the issue.

I'm scratching my head because nothing sounds obvious but I wonder if we can do some checks to account for the annotations at each step.

Just curious - can you run db-out first and inspect that output .jsonl?

prodigy db-out my_dataset > my_dataset.jsonl

Then you can run basic stats:

import srsly

examples = srsly.read_jsonl("my_dataset.jsonl")

span_cnt = 0
for eg in examples:
    for span in eg.get("spans"):
        span_cnt += 1

print(span_cnt)
print(len(examples))

Alternatively, you can try to pull all of the examples directly from your database using:

from prodigy.components.db import connect

db = connect() 
examples = db.get_dataset("your_dataset")

span_cnt = 0
for eg in examples:
    for span in eg.get("spans"):
        span_cnt += 1

print(span_cnt)
print(len(examples))

I think this is important first to diagnose any missing spans/records before running data-to-spacy or running custom scripts.

If the records are all there, at least you know it was saved in the database. Then you can be more confident there is a loss somewhere along the process of conversion.

If you don't know the exact count, perhaps you can run assert statements in the loop, looking specifically for 1 or 2 examples that you're confident should be in your data like "44 totems équipés".

For data-to-spacy, you can run spacy debug data on your spacy binary files / config to get some stats:

spacy debug data config.cfg --paths.train train.spacy --paths.dev dev.spacy

It can provide counts and number of spans by entity type. This would be helpful to provide too compared to your earlier numbers so we can account for them.

For the conversion back to .jsonl from binary, I don't see any major problems in your code (I did notice you have spankey as an argument but it isn't used). Just to make sure, you can also try this snippet to convert the binary files to .jsonl:

So if you can check for these:

  • Prodigy database counts: db-out and/or DB components to check for spans
  • spacy debug data: check for spaCy binary files from data-to-spacy
  • Run alternative binary to .jsonl conversion along with your script

Hopefully, you'll find out if there's any loss between any step. Let me know what you find!