Introducing recipes to bootstrap annotation via OpenAI GPT3

We're looking forward to publishing more along these lines, and I'm sure there will be lots to tweak and refine in the prompts. We haven't carried out very thorough experiments with this, so please let us know what you find.

Below I've reproduced the text of my Twitter/Mastodon thread, explaining what this is and why :slightly_smiling_face:

We've been working on new Prodigy workflows that let you use the @OpenAI API to kickstart your annotations, via zero- or few-shot learning. We've just published the first recipe, for NER annotation :tada: . Here's what, why and how. :thread:

Let's say you want to do some 'traditional' NLP thing, like extracting information from text. The information you want to extract isn't on the public web — it's in this pile of documents you have sitting in front of you.

So how can models like GPT3 help? One answer is zero- or few-shot learning: you prompt the model with something like "Annotate this text for these entities", and you append your text to the prompt. This works surprisingly well! It was an in the original paper.

However, zero-shot classifiers really aren't good enough for most applications. The prompt just doesn't give you enough control over the model's behaviour.

Machine learning is basically programming by example: instead of specifying a system's behaviour with code, you (imperfectly) specify the desired behaviour with training data.

Well, zero-shot learning is like that, but without the training data. That does have some advantages — you don't have to tell it much about what you want it to do. But it's also pretty limiting. You can't tell it much about what you want it to do.

So, let's compromise. We'll pipe our data through the OpenAI API, prompting it to suggest entities for us. But instead of just shipping whatever it suggested, we're going to go through and correct its annotations. Then we'll save those out and train a much smaller supervised model.

This workflow looks pretty promising from initial testing. The model provides useful suggestions for categories like "ingredient", "dish" and "equipment" just from the labels, with no examples. And the precision isn't bad — I was impressed that it avoided marking "Goose" here.

I especially like this zero-shot learning workflow because it's a great example of what we've always set out to achieve with Prodigy. Two distinct features of Prodigy are its scriptability and the ease with which you can scale down to a single-person workflow.

Modern neural networks are very sample efficient, because they use transfer learning to acquire most of their knowledge. You just need enough examples to define your problem. If annotation is mostly about problem definition, iteration is much more important than scaling.

The key to iteration speed is letting a small group of people — ideally just you! — annotate faster. That's where the scriptability comes in. Every problem is different, and we can't guess exactly what tool assistance or interface will be best. So we let you control that.

We didn't have to make any changes to Prodigy itself for this workflow — everything happens in the "recipe" script. You can build other things at least this complex for yourself, or you can start from one of our scripts and modify it according to your requirements.

If you don't have Prodigy, you can get a copy here: prodi.gy/buy. We sell Prodigy in a very old-school way, with a once-off fee for software you run yourself. There's no free download, but we're happy to issue refunds, and we can host trials for companies.

4 Likes

Hi, when trying to follow this command but I'm getting the following error:

Traceback (most recent call last):
  File "/home/poa3/anaconda3/envs/prodigy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/poa3/anaconda3/envs/prodigy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/poa3/anaconda3/envs/prodigy/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 384, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 73, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 170, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 104, in prodigy.components.feeds.Feed.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 150, in prodigy.components.feeds.Feed._init_stream
  File "cython_src/prodigy/components/stream.pyx", line 107, in prodigy.components.stream.Stream.__init__
  File "cython_src/prodigy/components/stream.pyx", line 58, in prodigy.components.stream.validate_stream
  File "/home/poa3/openai_ner.py", line 191, in format_suggestions
    for example in stream:
  File "cython_src/prodigy/components/preprocess.pyx", line 165, in add_tokens
  File "/home/poa3/anaconda3/envs/prodigy/lib/python3.10/site-packages/spacy/language.py", line 1545, in pipe
    for doc in docs:
  File "/home/poa3/anaconda3/envs/prodigy/lib/python3.10/site-packages/spacy/language.py", line 1589, in pipe
    for doc in docs:
  File "/home/poa3/anaconda3/envs/prodigy/lib/python3.10/site-packages/spacy/language.py", line 1586, in <genexpr>
    docs = (self._ensure_doc(text) for text in texts)
  File "/home/poa3/anaconda3/envs/prodigy/lib/python3.10/site-packages/spacy/language.py", line 1535, in <genexpr>
    docs_with_contexts = (
  File "cython_src/prodigy/components/preprocess.pyx", line 158, in genexpr
  File "/home/poa3/openai_ner.py", line 169, in stream_suggestions
    prompts = [
  File "/home/poa3/openai_ner.py", line 171, in <listcomp>
    eg["text"], labels=self.labels, examples=self.examples
TypeError: list indices must be integers or slices, not str

My examples file is formatted the same way it is on the github

  text: "Current symptoms of dyspnea are consistent with NYHA Class II-III and also has occasional exertional chest pressure"

  entities:
    measure:
      - NYHA Class II-III

Interesting. Could you share the command that you ran? Maybe with an example of the data that you gave? I just ran openai.ner.fetch locally and it didn't have this error message, but it can be that OpenAI returns a response that's unexpected. I can try to reproduce it locally if I have your example though!