text classification - is prodigy a good fit for the project?

Hey everyone,

I have been thinking about purchasing prodigy for months now. Since it costs like 3/4 of my phd "salary" I would like to check with you, whether it is a good fit for my project. At this point, I am mostly concerned with the annotation step.

In a nutshell, I have a corpus of news articles. Each article must get two labels based on two criteria.

1] In the interface (where annotation is done), I need to display fulltext of the article and highlight certain keywords. I have found this thread
In "textcat" recipes, is it possible to format the to-be-annotated texts? that has been resolved. However, I didn't get it completel. Does the teach simply show 'html' instead of 'text'? Is there any neat way to parse these html tags into the text that is currently being annotated on-the-fly, without having to preprocess the whole corpus first?

2] When reading the text classification workflow on the website, I am not completely sure I understand the move from annotating some subset of the data first and then active learning. Or does this actually suggest using active learning despite having no annotated examples at all? I don't have any model I could use for generating the suggestions, so my question is, are these two steps still separate or did I miss something?

3] I would like to attribute each example two codes at the same time. Each article chosen for annotation should therefore allow annotating two sets of single choices. Can this be done?

Thank you in advance!!

So first of all, I wish you'd seen the academic page! For PhD students we often grant a trial academic license, which we're pretty generous about extending: https://prodi.gy/academic . We approve these manually so there's often some delay, and we're not always able to provide the trial. But I think you'll probably be able to get started with Prodigy this way if you apply.

To answer your questions though:

  • If you need to have certain keywords highlighted, you can add a spans object to your stream examples, so you won't have to use HTML. You can use an HTML view, but it makes things a lot harder, because you'll have to map to and from the text in your recipe in order to update the model (since the models all expect text, not html).

  • For textcat, you can use active learning from a cold-start -- it will start off not really making helpful suggestions, and then gradually learn more. The same thing doesn't really work for NER though, since the ner.teach relies on the model predicting specific entities.

  • You can have two labels as check-boxes, which would allow you to annotate the labels at the same time, while also being non-mutually-exclusive.

1 Like

@honnibal Thanks a lot! This has been really helpful!

This sounds like a more elegant solution! Could you please point me to any example where this has been done? Parsing in the html tags seems easy to grasp, as I would only use regexp to locate the words, wrap them in the tags, and then save the result into the html tag in the same dict as text.

I am trying to wrap my head around what would this mean for the classification task. I am classifying at two levels, each having few labels plus 0 - irrelevant. After annotating using the check-boxes, would you suggest to train a single classifier to get all the checkboxes right or somehow splitting the annotated data in two subsets (if the none of the relevant labels is checked, then label as 0)? Also, if I would start using the active learning from the get go, wouldn't it be actually faster to train these two sets of labels separately to not confuse the model?