I’m looking into adding a few entities. I’m in the process of creating a dataset that contains both the existing entities and my new ones.

In order to do this I created a tool that does NER using the en_core_web_lg model and then allows me to edit the enities. One problem is that I am not able to find out the BILUO “value” from the token. token.ent_iob_ gives me the IOB “value” but I’d like to be able to have tags such as B-ORG I-ORG L-ORG.

Is there an easy way to get those from the tokens ? Is it important ? Is it a problem if I mix IOB entities and BILUO entities ?

Hi @areversat,

There are some helper methods in the module that I think will help you, specifically the functions, and perhaps also the function

I would usually recommend storing annotations in a stand-off format, like Prodigy does. Specifically, this means recording the start and end offsets of the characters, along with the label. The problem with BILUO is it ties the entity annotation to the tokens, when really the token boundaries are also an annotation — they might be incorrect, and they don’t preserve all of the information in the document.

As for mixing the BILUO and IOB encodings, potentially this would be a problem, yes! In the IOB scheme, the tag for a single-word entity is B. This would be an invalid sequence in BILUO, since in BILUO all B tags must be followed by I or L.

First of all thanks for the advice and for the tools you build.

So if I understand correctly, I would have something as follows (as in :

{'text': 'According to an estimate by Bank of America, something or other', 'entities': [(28, 43, 'ORG')]}

would be enough and I would let spacy figure what it needs to understand about the token in order to make an accurate prediction.

As it turns out prodigy ner.make-gold should work for my use case.

1 Like