I have a couple of questions concerning the possibilities of the implementation of prodigy in our workflow.
I am very new to the NLP world. And also new to Python. So my first question is if I can use prodigy nonetheless productively or would I have to have a stronger background in NLP programming (spaCy …)?
Secondly, we are working with historical cooking recipes (14th/15th century) in Early New High German, Middle French and Latin. And our task would be the (semi-)automatic recognition of ingredients and tools. Is this a possible usecase for prodigy? I am talking about 10.000 individual cooking recipes that need annotation. We are using TEI/XML as an end product.
I would suggest that you should probably set aside machine learning-based approaches for your task. That’s what otherwise would introduce a lot of complexity, and I don’t think it’s so well motivated for your problem. There’s a finite supply of texts that you’ll be working with — it’s not like they’re making more of them, after all. This means that a model won’t have much utility outside of the task of getting your labelling done. I also think it’ll be really hard to get an accurate model, because you won’t be able to leverage existing resources.
I think the best semi-automatic approach is probably the pattern matcher. You can write patterns based on token attributes, and use these to semi-automatically label the data. I think that’s a pretty good fit for your problem, and when you communicate the results, you can show exactly which patterns were used, so the process is much more transparent.
As for level of programming expertise: it’s hard to speak to this with confidence, as different people have reported quite different experiences of picking up these technologies to perform a particular task. I think you’ll probably be okay. In the worst case, you won’t be able to make very effective use of the pattern matching, and you’ll have to mostly label things by hand. I think even that won’t be too terrible.
Thanks a lot for your quick answer. Just to make sure i understand you right, you mean spaCy’s https://spacy.io/usage/rule-based-matching ? In order to efficiently use this, we would need correct PoS tagging before that right? Which is not easy to achieve with Early New High German e.g.
So you don’t see any chances for us in training our own model, because our dataset is too small if I understand you correctly?
Yes, the matching is performed by spaCy’s rule-based matcher. But Prodigy supports integration with it, so you can use the match patterns to suggest you things for annotation. You don’t have to use tags and things in your patterns. You could do define rules just based on word features like the suffix, regular expressions, etc.
I think making a tagger is likely to be more trouble than it’s worth. For modern English or German, the appeal is that if you make the tagger, you can tag lots of text for a long time going forward. This justifies a lot of upfront expense to make the model. For you, the data you have to run the tagger over is limited. This means you have to complete the tagger very quickly for it to be at all worthwhile. If it takes too long to make the tagger, you would’ve been better off just labelling the text directly.