training on a regression task

Hi there, I am looking to train a BERT model to estimate the level of friendliness in the text. The data is from online discussion fora. Typically only a sentence or two will be evaluated. The friendliness level should be assigned on a continuous scale (hence the: "problem_type": "regression"), for example:

  • -3 for very unfriendly
  • -2 for unfriendly
  • -1 for somewhat unfriendly
  • 0 for neutral
  • 1 for somewhat friendly
  • 2 for friendly
  • 3 for very friendly

Every text will be evaluated by a number of annotators and the target value for the text will be an average of all responses. Thus a target value can be (for example) 2.3 for quite friendly text.

I think that I can get away with using using 7 categories for annotation of the above mentioned friendliness levels, however it gets more complicated if I would like to benefit from all features offered by prodigy like training model in a loop with picking for human evaluation of examples that the model is most "confused about" (prefer uncertain). However it seems that the output should not be regarded as a set of discrete (unordered) categories, but instead is should be regarded as a continuous variable.

For example, for discrete categories the error/difference between 1.9 and 2.1 is considered "larger" than between 1.1 and say 1.8 (as for the first example if we think in terms of discrete categories we change the category label, that is not the case for the second example as both values stay within the same category, however obviously 0.2 < 0.7).

I have trained proof of concept for this task using pyTorch, now I would love to add more data with the help of prodigy :slight_smile:

Can textcat.manual be easily modified to be used with continuous output?

Would prodigy help in selection of the most difficult examples (prefer uncertain) for human evaluation?

Thank you for any help or suggestions.

hi @LucySkywalker!

Thanks for your questions and welcome to the Prodigy community :wave:

I would recommend creating a custom Prodigy recipe with your existing PyTorch workflow. The good news is there's a text classification template with docs that describe how to do this. I've tried to answer both of your questions directly below.

So there are two parts to the question: the UI/interface (i.e., creating a way to capture the continuous output) and the aligned spaCy component to train your model.

UI/Interface

It is possible to create a custom Prodigy interface that can allow a continuous annotation. This is where you can use the concept of blocks to combine different interfaces.

For example, you can create a slider like this:

Here's a more detailed example of a slider:

The @tannonk also has a helpful GitHub repo with the recipes.

https://github.com/tannonk/prodigy_human_evaluation/tree/master/examples

There may be some HTML/JavaScript customizing that's needed but that can at least allow you to get users' input in a continuous format.

Training / Model

This is more of a challenge. Out-of-the-box, textcat.manual is for training model's using spaCy's textClassifier textcat or multilabel_textcat components. The problem is to my knowledge, neither of those components offer a regression training:

Therefore, if you wanted to train your model with spaCy, you'd need to create a custom spaCy component to handle training of a continuous value (regression).

An alternative: use your existing PyTorch model/setup

Given you have a PyTorch PoC, I would recommend skipping spaCy and use Prodigy to create your own PyTorch workflow.

There's a section in the Text Classification documentation on how to create a custom Prodigy recipe for a different model workflow.

As linked in those docs, I'd recommend starting with this script:

Yes! See the sub-section in the docs but you can use one of Prodigy's sorters to specify how you want Prodigy to use active learning to modify the order of your records for annotation:

from prodigy.components.sorters import prefer_uncertain

model = Model()
stream = model(stream)
stream = prefer_uncertain(stream)

After implementing this workflow, you'll likely still need to use the slider example above so that annotators can provide their continuous annotation value.

I hope this helps and let us know if you have questions (or please post back if you make progress!).

1 Like

@ryanwesslen many thanks for your tips and suggestions! :smile: I'll be trying to implement them in the next few weeks, I'll post back on my progress...