trying to link words in two spans to form 1 entity in prodigy.

In the above example due to some business specific conventions I have an entity that is split into two spans. When trying to grab them both in Manuel NER prodigy selects the entire 2 spans not the word cluster in red… Is their some way to chain two words that are disjoint as one entity in the jsonl file (if we try to do it inline) for example some Bilou patter like B OOOOOOOOOOOOOOOOOOIE. This may not even work, given we are shifting and poping words onto a buffer in spacy and they are very far apart in terms of teaching a custom NER. But I figured I would ask

I hope I understand your question correctly – but I think the underlying thing here is that the model (and pretty much any computational process) will read text in from top to bottom, left to right, character by character. The two spans may be aligned visually – but to the machine, they’re far apart and just not sequence.

Entity spans are defined as a sequence of tokens and that’s why the entity recognizer is trying to predict – so something like B-O-O-O-I-L wouldn’t be considered a valid entity sequence. Predicting O after B would be considered an illegal move. (For more details on transition-based NER, you might want to check out this video).

Is there a specific reason you want the column heading (?) included in the entity span? If your example is representative of the type of data you’re working with, a purely NER-based approach seems a bit unideal. There are only short text fragments and what you consider a “sequence” isn’t even an actual sequence.

So you might want to experiment with only predicting more generic concepts and then using the surrounding token context to resolve those back to their headlines. See this thread for some ideas and inspiration. You could also try adding a pre-processing step that reformats your raw text in a way that the true word order matches the logical word order. Finally (this is more experimental), you could try framing the problem as a computer vision task (!), try to predict the information based on the position in the document or section, and then extract the text content from the predicted bounding boxes afterwards.