training on a regression task

LucySkywalker · December 17, 2022, 12:41am

Hi there, I am looking to train a BERT model to estimate the level of friendliness in the text. The data is from online discussion fora. Typically only a sentence or two will be evaluated. The friendliness level should be assigned on a continuous scale (hence the: "problem_type": "regression"), for example:

-3 for very unfriendly
-2 for unfriendly
-1 for somewhat unfriendly
0 for neutral
1 for somewhat friendly
2 for friendly
3 for very friendly

Every text will be evaluated by a number of annotators and the target value for the text will be an average of all responses. Thus a target value can be (for example) 2.3 for quite friendly text.

I think that I can get away with using using 7 categories for annotation of the above mentioned friendliness levels, however it gets more complicated if I would like to benefit from all features offered by prodigy like training model in a loop with picking for human evaluation of examples that the model is most "confused about" (prefer uncertain). However it seems that the output should not be regarded as a set of discrete (unordered) categories, but instead is should be regarded as a continuous variable.

For example, for discrete categories the error/difference between 1.9 and 2.1 is considered "larger" than between 1.1 and say 1.8 (as for the first example if we think in terms of discrete categories we change the category label, that is not the case for the second example as both values stay within the same category, however obviously 0.2 < 0.7).

I have trained proof of concept for this task using pyTorch, now I would love to add more data with the help of prodigy

Can textcat.manual be easily modified to be used with continuous output?

Would prodigy help in selection of the most difficult examples (prefer uncertain) for human evaluation?

Thank you for any help or suggestions.

ryanwesslen · December 19, 2022, 7:00pm

hi @LucySkywalker!

Thanks for your questions and welcome to the Prodigy community

I would recommend creating a custom Prodigy recipe with your existing PyTorch workflow. The good news is there's a text classification template with docs that describe how to do this. I've tried to answer both of your questions directly below.

So there are two parts to the question: the UI/interface (i.e., creating a way to capture the continuous output) and the aligned spaCy component to train your model.

UI/Interface

It is possible to create a custom Prodigy interface that can allow a continuous annotation. This is where you can use the concept of blocks to combine different interfaces.

For example, you can create a slider like this:

Here's a more detailed example of a slider:

The @tannonk also has a helpful GitHub repo with the recipes.

https://github.com/tannonk/prodigy_human_evaluation/tree/master/examples

There may be some HTML/JavaScript customizing that's needed but that can at least allow you to get users' input in a continuous format.

Training / Model

This is more of a challenge. Out-of-the-box, textcat.manual is for training model's using spaCy's textClassifier textcat or multilabel_textcat components. The problem is to my knowledge, neither of those components offer a regression training:

Therefore, if you wanted to train your model with spaCy, you'd need to create a custom spaCy component to handle training of a continuous value (regression).

An alternative: use your existing PyTorch model/setup

Given you have a PyTorch PoC, I would recommend skipping spaCy and use Prodigy to create your own PyTorch workflow.

There's a section in the Text Classification documentation on how to create a custom Prodigy recipe for a different model workflow.

As linked in those docs, I'd recommend starting with this script:

github.com

explosion/prodigy-recipes/blob/master/textcat/textcat_custom_model.py

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.sorters import prefer_uncertain
from prodigy.util import split_string
import random
from typing import List, Iterable


class DummyModel(object):
    # This is a dummy model to help illustrate how to use Prodigy with a model
    # in the loop. It currently "predicts" random numbers – but you can swap
    # it out for any model of your choice, for example a text classification
    # model implementation using PyTorch, TensorFlow or scikit-learn.

    def __init__(self, labels: List[str]):
        # The model can keep arbitrary state – let's use a simple random float
        # to represent the current weights
        self.weights = random.random()
        self.labels = labels

This file has been truncated. show original

Yes! See the sub-section in the docs but you can use one of Prodigy's sorters to specify how you want Prodigy to use active learning to modify the order of your records for annotation:

from prodigy.components.sorters import prefer_uncertain

model = Model()
stream = model(stream)
stream = prefer_uncertain(stream)

After implementing this workflow, you'll likely still need to use the slider example above so that annotators can provide their continuous annotation value.

I hope this helps and let us know if you have questions (or please post back if you make progress!).

LucySkywalker · December 20, 2022, 3:30am

@ryanwesslen many thanks for your tips and suggestions! I'll be trying to implement them in the next few weeks, I'll post back on my progress...

Topic		Replies	Views
Best Practices for text classifier annotations usage , textcat , best-practices	7	5004	March 24, 2021
Multi-label annotation with Transfer Learning textcat , solved , best-practices	5	980	June 6, 2020
Help needed to get started with text classification usage , textcat	10	3516	January 14, 2019
Email categorization with Prodigy usage , ner , textcat	2	398	February 15, 2022
Reduce the number of categories in textcat project usage , textcat , database	5	251	May 4, 2023

training on a regression task

UI/Interface

Training / Model

An alternative: use your existing PyTorch model/setup

Related topics