Hi @ines and @honnibal. I’m actively developing on top of spaCy for advanced chatbots and virtual assistants. I’m considering buying Prodigy for our startup, but wanted to see if you are considering some of our core use cases for the roadmap.
We would definitely make use of the text classification and manual NER interfaces, but some of our key bottlenecks for training are POS tagging, dependency labelling, and nested entity labelling. We are finding for specific language domains (transport, retail, etc.) there needs to be a well-labelled training set for best performance and these are time consuming to produce. Our current solution is using a modified version of Brat for annotation which we then convert to spaCy format for training (repeat as necessary), but this is still cumbersome.
So… any plans to add interfaces (similar to manual NER) for POS tags, depencies, and nested entities? Nested entity labelling (e.g. base entities that comprise address components, datetime components, etc.) could possibly achieved through spans, but it would great if there were clear examples of using Prodigy for this purpose.
FYI - for long nested compound entities like addresses and datetimes that have variable structures, we’ve found they can cause a lot of errors during POS tagging and dependency labelling, and so it is sometimes best to detect and remove them (e.g. merge them) prior to these. It is a bit of a chicken or egg problem though, as the parse obviously helps define the “slot” that these long entities appear in. More of a spaCy question, but I’m always looking for good suggestions on how to do this more effectively.
Yes, an interface for dependency and relationship annotation is definitely very high on our list. In the meantime, you can check out my comments on thread, in which I outline some strategies for making it work with the current interfaces, or how to create your own interface.
So even if it turns out that your requirements are very specific, you'll always be able to mix and match the interfaces and create your own custom workflows. Prodigy makes very little assumptions about your task and simply presents the data in the UI, and stores JSON-formatted annotations.
For example, you could create a custom HTML template to display the annotation task however you like. Or you could repurpose the manual NER annotation interface to annotate POS tags instead of entities. This should work pretty well out-of-the-box for creating gold-standard data – you can simply pass in your POS tag scheme as the labels, annotate your corpus and export the JSON-formatted data.
Annotating nested entities can easily get fiddly, so we'd love to come up with a nice and efficient solution for annotating them. In many cases, it's actually easier (and more efficient!) for the human annotator to make more passes over the data, rather than doing too much at once. So even with the existing interfaces, you could do one round of annotating only the parent entities, e.g. the full date spans. For the second round, you can simply export the dataset and do another pass over the already annotated spans, highlighting only the "children".
(The upcoming version of Prodigy will also include an option to flag annotation tasks, so all the annotator would have to do is highlight the "parent" entity and flag the task if it contains a nested entity, to make it even easier to extract and reannotate the relevant examples.)