I’m looking into adding a few entities. I’m in the process of creating a dataset that contains both the existing entities and my new ones.
In order to do this I created a tool that does NER using the en_core_web_lg model and then allows me to edit the enities. One problem is that I am not able to find out the BILUO “value” from the token. token.ent_iob_ gives me the IOB “value” but I’d like to be able to have tags such as B-ORG I-ORG L-ORG.
Is there an easy way to get those from the tokens ? Is it important ? Is it a problem if I mix IOB entities and BILUO entities ?
There are some helper methods in the spacy.gold module that I think will help you, specifically the functions spacy.gold.iob_to_biluo, and perhaps also the function spacy.gold.biluo_tags_from_offsets.
I would usually recommend storing annotations in a stand-off format, like Prodigy does. Specifically, this means recording the start and end offsets of the characters, along with the label. The problem with BILUO is it ties the entity annotation to the tokens, when really the token boundaries are also an annotation --- they might be incorrect, and they don't preserve all of the information in the document.
As for mixing the BILUO and IOB encodings, potentially this would be a problem, yes! In the IOB scheme, the tag for a single-word entity is B. This would be an invalid sequence in BILUO, since in BILUO all B tags must be followed by I or L.