A quick feedback

Hey !
I’ve just downloaded and tried prodigy, and here is a quick feedback.

  • I installed prodigy in a virtualenv (Python 3.6, fedora), and the .prodigy folder was not created in ~, so I had to create it manually and add a prodigy.json file.
  • Before creating the file manually, I tried to create a new dataset. But as there were no .prodigy file, a peewee.OperationalError: unable to open database file error was raised. Maybe this error could be caught, with a more explicit message (missing config file for instance).
  • after starting the webserver (with prodigy textcat.teach for example), you could display the http link to open (such as http://127.0.0.1:8000), so that we could in one click open the web view, instead of typing it manually in the browser. Not really a must-have, but it would be nice.

Most (not to say all) of the texts I annotate are in French, so I currently lack a real-world case to test prodigy capabilities. Fortunately, a French model is on its way.

Which brings me to another suggestion :wink:
Would it be possible to use external models (written in Keras, Tensorflow,…) into prodigy?

Hi Raphael!

Excited to have you trying Prodigy :slight_smile: . We’ll take care of those install/setup warts, thanks a lot for the feedback.

About the text classifier: Prodigy doesn’t really use any of the pre-trained components for the text classification at the moment, except perhaps sentence segmentation. You should therefore be fine to give it a blank model. We should have a mode where this is created for you, but in the meantime you could do:


import spacy

nlp = spacy.blank('fr')
nlp.to_disk('/tmp/fr')

This will write out a spaCy model with the correct directory structure (the spacy.blank() function is new — it’s the twin of spacy.load(); it just gives you an nlp object with the language code.)

We’ve just been putting the finishing touches on support for longer texts in the text classifier. The strategy is to segment the text into sentences, and then ask you about the sentence it finds most relevant. For now, the short text classifier should already be working well.

You can also use Prodigy with models from other libraries. We’re still working on the examples for this, but if you want to start tinkering, there are two places you could put the model:

  1. Inside the (brand new) spaCy TextCategorizer class. Implementation is here: https://github.com/explosion/spaCy/blob/develop/spacy/pipeline.pyx#L538 . You could pass in a model as the model keyword arg on initialization, or subclass and override the Model method.

  2. Write your own wrapper, but still use spaCy. Instead of adapting the TextCategorizer class, you could write your own, and append it to nlp.pipeline. Your class will just need to add the label scores to the doc.cats dictionary. The key is expected to be a string, and the value should be a float between 0 and 1.

  3. You could avoid use of spaCy altogether, and instead write a recipe function that uses your own prediction and update functions. You can see the built-in recipe function here: https://prodi.gy/docs/workflow-text-classification#recipe . There should be more details about this in the readme as well. Briefly, you should apply the prediction part of your model in a generator, that yields (score, example) tuples. You’ll then set an update callback, in the dict returned from your recipe. Here’s an example with a trivial unigram perceptron model (warning – untested code; I just typed this up…)

import sys
import prodigy
from prodigy.components.sorters import prefer_uncertain
from collections import Counter

@recipe('my_textcat_teach',
    dataset=prodigy.recipe_args['dataset'],
)
def custom_textcat_teach(dataset, label=''):
    """Annotate texts to train a new text classification label, using a toy classifier"""
    stream = ({'text': line.strip(), 'label': label} for line in sys.stdin)

    weights = Counter()
    def model(stream):
        for eg in stream:
            words = eg['text'].split()
            score = sum(weights[word] for word in words)
            yield score, eg

    def update(answers):
        for eg in answers:
            words = eg['text'].split()
            if eg['answer'] == 'accept' and score < 1:
                gradient = -1
            elif eg['answer'] == 'reject' and score > 0:
                gradient = 1
            else:
                gradient = 0
            if gradient != 0:
                for word in words:
                    weights[word] -= gradient

    return {
        'dataset': dataset,
        'view_id': 'classification',
        'stream': prefer_uncertain(model(stream)),
        'update': update,
        'config': {'label': model.label}
    }
1 Like

Thanks for your feedback! Will look into the .prodigy stuff – there have been some other weirdnesses with that, so we might have to find a better way of doing this in the setup.py.

after starting the webserver (with prodigy textcat.teach for example), you could display the http link to open (such as http://127.0.0.1:8000)

Sure, that's no problem (and so obvious – no idea why we didn't think of this!)

Btw, if you already have text classification annotations in French from some other project, you could also import them to a new dataset and then run textcat.batch-train with no input model and --lang fr. This will create a blank French model as the base model and train it with your existing annotations. See the first steps workflow for an example of how to import existing annotations. Any format supported by Prodigy should work – for example, a JSON or CSV file with keys/columns text and label.

Thank you for responses! I will try to use a blank French model.
Great piece of software by the way :ok_hand:

Thanks! Just pushed another update that adds the full URL to the web server startup message, and addresses the .prodigy issue. The path to the Prodigy home directory is now also displayed when you run prodigy stats, so you can easily check which path is used.

There’s also a new, experimental --long-text mode in textcat.teach that extracts the most relevant sentences from very long examples, and displays them in context in the web app. This way, the Prodigy interface can handle long-text classification more easily – and the annotator has to read less and can move faster.

1 Like