@wpm Thanks for this; it’s quite thought-provoking.
Sometimes I battle between two perspectives. On the one hand, I think: isn’t it amazing what we can do with ML and NLP? Things that used to be multi-year research project pilot studies can now be shipped by one or two people, often with fairly minimal experience. But then other times I think: this stuff barely works, none of it makes sense, and my best advice often boils down to “I don’t know, have you tried turning it off and on again?”. It’s sort of this Lovecraftian thing: sometimes it feels like I’m in a world of stability and order, but then I hear whispers from the void that tell of madness that lies beneath.
In Prodigy we’ve tried to make the “happy path” quite painless, while also trying to make sure that all the underlying bits are exposed, so you can swap things out and play with the internals. We want to give people a command that’s like, “Train the text classifier on your batch of annotations”. So I build a model that’s usually good for that, and try to give it sensible defaults. But that’s not the last word on text classification! Sometimes you do need to do something completely different. That’s why we have these wrappers. We don’t even want to tie the library to spaCy — we want it to work with any other solution you provide, so long as that solution has the right capabilities.
So yes sometimes you definitely do need to build things with spaCy to solve a particular problem. And in fact, even spaCy isn’t a 100% satisfying wrapper around spaCy – spaCy also does the thing that’s confusing you here: it gives you mostly-good defaults, that sometimes you need to go and change. And past that, sometimes you’ll need to use an entirely different solution: spaCy provides the best general-purpose NLP things we know how to build; but that doesn’t mean no other NLP or ML tools will be useful for your problems.
Anyway. To answer your specific confusions. It might help to remember that Prodigy’s main mission in life is to ask you questions so you can annotate data. All the processing tools within Prodigy are in assistance of that mission. spaCy’s mission is to add annotations to text, and make the annotation easy to work with. Prodigy needs annotations to ask its questions, so it calls into spaCy. But you could call into other solutions, even simple functions you write yourself.
prodigy.models takes an iterator over tasks and returns an updated iterator, with scores and usually different questions. The purpose is to make a feed of questions for data annotation. It might be accidentally useful for other purposes, but that’s incidental.
Well, spaCy provides two different algorithms — which is unfortunate; we really try to give one of everything. We added the second algorithm, beam search, just for Prodigy. We considered putting it inside Prodigy, but it was too much behind the closed-source curtain — we strongly prefer to expose these things to you.
The beam search is important in producing varied NER questions. It’s designed to handle the case where the model’s really confidently wrong in its predictions. In this situation, beam search can still give you questions that help you guide the model back to the correct analysis. Without beam search, the annotation can get “stuck” in bad states.
EntityRecognizer.__call__ method is designed to give you this variety of questions, from across the beam. If you’re just trying to add annotations to text, this is almost certainly not what you want.