Newbie working with historical languages

chsteiner · March 25, 2019, 10:46am

I have a couple of questions concerning the possibilities of the implementation of prodigy in our workflow.

I am very new to the NLP world. And also new to Python. So my first question is if I can use prodigy nonetheless productively or would I have to have a stronger background in NLP programming (spaCy …)?

Secondly, we are working with historical cooking recipes (14th/15th century) in Early New High German, Middle French and Latin. And our task would be the (semi-)automatic recognition of ingredients and tools. Is this a possible usecase for prodigy? I am talking about 10.000 individual cooking recipes that need annotation. We are using TEI/XML as an end product.

Thanks a lot for you experience and opinions!

honnibal · March 25, 2019, 11:17am

Interesting use-case!

I would suggest that you should probably set aside machine learning-based approaches for your task. That’s what otherwise would introduce a lot of complexity, and I don’t think it’s so well motivated for your problem. There’s a finite supply of texts that you’ll be working with — it’s not like they’re making more of them, after all. This means that a model won’t have much utility outside of the task of getting your labelling done. I also think it’ll be really hard to get an accurate model, because you won’t be able to leverage existing resources.

I think the best semi-automatic approach is probably the pattern matcher. You can write patterns based on token attributes, and use these to semi-automatically label the data. I think that’s a pretty good fit for your problem, and when you communicate the results, you can show exactly which patterns were used, so the process is much more transparent.

As for level of programming expertise: it’s hard to speak to this with confidence, as different people have reported quite different experiences of picking up these technologies to perform a particular task. I think you’ll probably be okay. In the worst case, you won’t be able to make very effective use of the pattern matching, and you’ll have to mostly label things by hand. I think even that won’t be too terrible.

chsteiner · March 25, 2019, 11:59am

Thanks a lot for your quick answer. Just to make sure i understand you right, you mean spaCy’s https://spacy.io/usage/rule-based-matching ? In order to efficiently use this, we would need correct PoS tagging before that right? Which is not easy to achieve with Early New High German e.g.
So you don’t see any chances for us in training our own model, because our dataset is too small if I understand you correctly?

honnibal · March 25, 2019, 12:18pm

Yes, the matching is performed by spaCy’s rule-based matcher. But Prodigy supports integration with it, so you can use the match patterns to suggest you things for annotation. You don’t have to use tags and things in your patterns. You could do define rules just based on word features like the suffix, regular expressions, etc.

I think making a tagger is likely to be more trouble than it’s worth. For modern English or German, the appeal is that if you make the tagger, you can tag lots of text for a long time going forward. This justifies a lot of upfront expense to make the model. For you, the data you have to run the tagger over is limited. This means you have to complete the tagger very quickly for it to be at all worthwhile. If it takes too long to make the tagger, you would’ve been better off just labelling the text directly.

chsteiner · March 25, 2019, 12:37pm

ok, thanks a lot for your input!

Topic		Replies	Views
discourse analysis recipe enhancement , usage , done , relations	5	1464	June 17, 2020
Create PhraseMatcher in Spacy and use them to Label data manually ner , spacy , solved , medical	9	1564	December 15, 2020
Named Entities(manual) usage , ner , solved	4	803	May 11, 2018
I'm new to python and NLP. I would like to evaluate Prodigy and need guidance on getting started. usage , best-practices	3	562	February 16, 2021
prelabel data using regex and how to use the active learning functionality and get the model usage , ner , spacy	3	545	October 14, 2021

Newbie working with historical languages

Related topics