Cant load pre-annotated ner jsonl


I am trying to load a pre-annotated file into Prodigy. The file was coverted from doccano annotation jsonl. I have generate the same file, as a list of lines with the bellow format:
{"text": "English is required.", "spans": [{"start": 0, "end": 6, "label": "language"}]}

I am able to do a prodigy db-in and even a prodigy train ner with this file, but I cant use the annotation tools, such as ner.manual. It says there is no label in my file.

Also, I would like to load this annotated samples and, later on, load all the other samples I need to annotate.

I have searched this forum and googled a lot of "prodigy load ner annotations" or the error messages, with no luck.

Any idea?

Thank you

Hi! Your format looks correct :+1: What's the exact error message that you're seeing when you're loading your example into ner.manual? Are you sure the error refers to the file, and not the labels defined on the command via the --label argument?

After some testing with other dataset, I realized two things were happening:

  • ner.manual doesnt extract the labels from the spans. Passing the --label argument is kind of mandatory;
  • Imported dataset (db-in) with pre-annotated texts, come with the accepted field as true, and ner.manual doesnt show anything already accepted.

I have made a script to convert annotated doccano files to spacy jsonl, and another one to get more data to be annotated with prodigy, but removing the ones already annotated with doccano.
Now, everything works as expected. Awesome product!

Thanks for the update, glad it's working! :tada:

Ah yes, you should always pass in the --label argument with the labels you want to annotate – if not, Prodigy will try to read them from the model and if that doesn't have an entity recognizer, it will show you an error because it doesn't know what the label set is.

The data you load in is read as a stream, so Prodigy can't really know what labels are going to be in the data when you start the server – otherwise it'd have to parse the entire file upfront.

There's typically no need to pre-import anything you want to annotate – that's only needed for existing annotations you want to keep. Prodigy will asign hashes to the examples and will automatically skip examples that are already in the database and answered.

It looks the real problem lies on the misaligned tokens in relation to the spans. By using spacy biluo_tags_from_offsets, I find out 54 of my annotated documents have misaligned tokens/spans. I have tried to use the --highlight-chars with ner.manual, but it stills throw an exception, while the prodigy web page shows a temporary red alert message (ERROR: cant fetch tasks. make sure the server is running....) and a "Oops, something went wrong :(" is shown in the main screen.

Is there a way to completely disable token misalignment checks in all recipes? I dont plan to use spacy tokenizers.
Btw, I m using pt_core_news_sm as base model, but have also tested blank:pt, blank:en and even a huge fasttext model for portuguese I have created with domain text, converted to spacy model.

Just figured out. If I point the ner.manual to the dataset created previously by db-in pre-annotated jsonl, it wont work. I have to make ner.manual to a new dataset and pass --highlight-chars to it. No need to db-in first, as stated in the NER Documentation, QuickStart, "I already have annotations and just want to train a model".

Thanks for updating and yes, that's correct! :slightly_smiling_face:

I need to double-check this, but if pre-annotated spans with --highlight-chars raise token misalignment errors, that's definitely something we should fix, though. In that case, it should just align the spans to the characters, which will always succeed. (In that case, you could work around this by passing in data with "tokens" including an entry for each character. Not that pretty but that property will be removed from the data saved to the DB.)

Or, if you know which tokenization you're going to be using, you might as well use that to pre-tokenize the text so you know you're always creating consistent data and token-based tags (assuming the goal is training an NER model – otherwise it may be less relevant).

I have just saw the server crashing after processing (ner.manual with --highlight-chars) the 88th document in a pre-annotated (with just spans) jsonl. I have repeated it with a new dataset and typing (a)ccept quickly and the same error (exception, and server crash) happened at the same point. If you want, I can send you this file, so you can debug it.

Sure, if you're able to share that file or just the one example, that would be helpful :+1: And by crashing, do you mean a Python error? Is there any message?