NER Evaluation set format


I want to make an evaluation set out of my gold annotations that I’d created with ner.make-gold. I want to use this evaluation set as a general set for the evaluation and comparison of different model outputs, not only those created by prodigy, but also by other learning algorithms.
Which format should this evaluation set have? Is there a standard for evaluation sets in NER?

Can you give me any hints what an evaluation set should look like? Where I can find more information about this?

Thanks in advance!


There’s not really a single standard data format people are using. Many tools use a format like this:

Apple|U-ORG is|O a|O company|O

There are many small variations on this format, e.g. some tools expect another column with the POS tag. Another common format has one token per line, and the attributes in columns.

I would recommend against storing the data in these formats which interleave the tokens and attributes like this. The reason is, this format removes information from the text, by assuming a tokenization. If you later want to change the tokenization, you’ll have trouble doing that.

I would recommend keeping the data in a format similar to Prodigy’s. The main characteristics of the format I’d recommend are:

  • jsonl: Newline delimited json is easy to read, and reasonably computationally efficient. Making one json object out of the whole data is much slower and more memory intensive.

  • Preserve the input: Have a record in the json object for the original data, so you know you’ve always got the original input available. Arguably, it’s better to keep the input in a separate database or separate structure, but it’s not always necessary to do that.

  • Spans as stand-off annotations: To reference an entity, you should take the start and end character offset, or perhaps a start offset and a length, as well as the label.

The task of converting into the various formats other tools use is then an arbitrary detail of the experiment script that you use with that tool.