Hi! I think this mostly comes down to the training data and what the model considers most likely based on the examples it was trained on. If you look at this particular phrase, "John Smith Sports Center", it's pretty reasonable for the model to end up with a higher probability for interpreting "John Smith" as a
PERSON, or phrased from the model's perspective in the token-based tags,
["B-PERSON", "L-PERSON", "O", "O"] ended up winning over
["B-ORG", "I-ORG", "I-ORG", "L-ORG"].
So one solution is to just provide more training examples of similar constructions where the whole span is tagged as
ORG, so the model gets to learn the distinction better. Depending on how your data was annotated, you might also want to do a quick audit of the
ORG entities to make sure they're consistent – maybe an annotator got confused and you ended up with some annotations where "John Smith" was annotated as a person. If your data is inconsistent, the model will have a much harder time. Finally, if you're dealing with very ambiguous entities, you may have to accept some false positives/negatives here – even if your model is 90% accurate, it still means that every 10th prediction it makes is wrong. That's pretty normal and you can often work around it by adding some special case rules on top to catch incorrect predictions on ambiguous cases.