Training model using annotations imported from db-in has 0.00 for all Scorer output

I wrote a small program that converts an Excel file to a JSON file that follows the convention of JSONL files exported via the db-out command, including generating an input hash for each "record" (though I have also tried a version of the file without hashes, input or otherwise).

When feeding this file into db-in, the import is successful and it recognizes the correct number of annotations (500 in this case). I then run prodigy train ./ --ner {dataset name} which starts training the model (while also indicating that the imported data was formatted correctly- I've noticed it will throw a validation error at this point if there is an invalid value for a field, for example). The Scorer output has all values for all columns at 0.00, which suggests to me that there is something wrong with the JSON file I am feeding into db-in.

I also tried providing a base model, but this also yields 0.00 for everything, including Score: prodigy train ./ --ner {dataset name} --base-model en_core_web_sm

Finally, I tried exporting a dataset with annotations using db-out, and then immediately importing it back in as a new dataset via db-in, and trained it using the en_core_web_sm base model. This will yield non-zero values in the Scorer output, so this appears to be a 100% supported workflow.

I would prefer not to upload the JSON file I am feeding into db-in, partially because I have tried several permutations with different properties included/omitted.

I would like to first verify that I should be able to take a file generated from db-out and feed it into db-in, into a new dataset let's say? If this workflow is supported, then what I am trying to do with this JSON file generated from pre-annotated data should also work. It seems like, between db-in reporting 500 annotations imported, and the training outputting 0.00, that either a property is present in my generated JSON file that should not be, or a property is missing. If I could see an example of externally-annotated JSON that I can import into Prodigy for use with the NER recipes, that might be helpful. I see the example here, but that doesn't seem to be for NER specifically. Thank you!

Hi @jspinella !

Perhaps the issue is in the converter itself. Make sure that you're setting the correct token indices and span indices for training. My hunch is that model training can't "see" the tokens because the indices provided were wrong or out of range.

Another suggestion would be: since you're already using an Excel file, would it be better to convert it to CSV and use the supported csv converter instead?

Thank you for the suggestions! One of the columns in the file needs to be cleaned so I opted to not use Prodigy's CSV loader. To be clear, I am manually converting the XLSX to CSV, then loading that into a small program that converts it to a Prodigy-formatted JSON file. Just a little PoC for now.

I have double-checked the token indices and the span indices. They look correct to me (comparing to a JSONL file generated from prodigy-out on a simple test model that scores 1.00). It looks like the issue was related to the indices, more-specifically the presence of "ws": true/false for each token. The way I have it, ws is true for every token except the last one in a sentence (where there is no whitespace character proceeding it). This is how the JSONL that comes from prodigy db-out looks, but it's odd to me. I wouldn't think the whitespace coming after a token, that is in-between two tokens, should be included in the token.

I'm not really finding anything on this "ws" property in the JSON/JSONL, but it looks like it's for whitespace and denotes whether the (end?) index of the token is a whitespace character?

Presumably I could exclude the whitespace between two words (tokens) when calculating the index values, for example given "Apple has released a new Mac", the index values for "Apple" would be 0 and 4 with ws: false and "has" would pick up with 6 and 8... rather than how db-out would have it: 0 and 5 with ws: true. I'm not sure if it's a tomato-tomato situation or if this would hurt the model accuracy...

Hi @jspinella ,

If you're using NER, you don't need to provide the ws value in the JSONL file. You can refer to this task format. The token and span indices are enough.