Okay, this probably explains a lot! You should definitely reject incomplete entities and any other suggestions that are wrong (even if it's painful sometimes if the model almost got it right ). My comment on this thread explains the reasoning behind this in more detail:
So if you've been annotating differently, I'd definitely suggest to convert your existing annotations to gold-standard, pre-train your model from that and then try the binary workflow again starting with a fresh dataset. You could also try adding some patterns when you run ner.teach, to make sure the model sees enough positive examples during annotation. For example, some street names or abstract patterns of possible street names could work well (e.g. any token + - + any token + "avenue").
This is interesting and definitely something I'd keep an eye on! (Also a nice example of why it's always super important to reason about the data and be familiar with both the language and the domain!)