✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more

When I am trying to train with a custom model for tokenization
prodigy train-curve --ner gen-onco-train -m en_core_web_sm_pathology --eval-split 0.2

I'm getting:


ℹ Using config from base model
✔ Generated training config

=========================== Train curve diagnostic ===========================
Training 4 times with 25%, 50%, 75%, 100% of the data

%      Score    ner
----   ------   ------

...

  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/recipes/train.py", line 331, in train_curve
    config, gpu_id=gpu_id, overrides=overrides, silent=True
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/recipes/train.py", line 172, in _train
    spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/training/loop.py", line 115, in train
    raise e
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/training/loop.py", line 98, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/training/loop.py", line 192, in train_while_improving
    for step, (epoch, batch) in enumerate(train_data):
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/training/loop.py", line 303, in create_train_batches
    raise ValueError(Errors.E986)
ValueError: [E986] Could not create any training batches: check your input. Are the train and dev paths defined? Is `discard_oversize` set appropriately?

I can't see discard_oversize in the docs: prodigy train-curve --help | grep discard_oversize is empty. How can I proceed here?

Could you share the error that you see?

The training happens in spaCy, so discard_oversize in this case is a config parameter for spaCy. I don't think that's really a problem here.

How many examples are in gen-onco-train? Is it possible that you end up with no examples to train from? You can also try setting the environment variable PRODIGY_LOGGING=basic, which should show you more details.

Thanks for your attention and help to my issue. I think, that the new spans.manual UI could be extremely powerful for information extraction purposes in general. To support this, it would be nice, if annotated spans could be edited afterwards (to detail align certain span indicators in complex and overlaid situations). For the purpose of information extraction it could be very helpful, if pattern matching could be enabled to catch token sub spans also (perhaps in a recipe without any model tokenization requirement). I hope, that this does not open Pandorra's box and that the information is still helpful.

Hello, I am trying to replicate the example of multi class classification annotations: https://prodi.gy/docs/computer-vision#classification-multi

when I run the recipe.py script, I got the following error:

(prodigy_env) andreykormilitzin@Andreys-MBP annotations_for_images % python -m prodigy classify-images sunglasses_brands ./data_imgs -F recipe.py
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/andreykormilitzin/Documents/virtual_envs/prodigy_env/lib/python3.9/site-packages/prodigy/__main__.py", line 54, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 339, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 361, in prodigy.core._components_to_ctrl
  File "cython_src/prodigy/core.pyx", line 141, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 155, in prodigy.components.feeds.SharedFeed.validate_stream
  File "/Users/andreykormilitzin/Documents/virtual_envs/prodigy_env/lib/python3.9/site-packages/toolz/itertoolz.py", line 376, in first
    return next(iter(seq))
  File "/Users/andreykormilitzin/Documents/annotations_for_images/recipe.py", line 17, in get_stream
    for eg in stream:
  File "cython_src/prodigy/components/loaders.pyx", line 216, in Images
  File "cython_src/prodigy/components/loaders.pyx", line 249, in Base64
  File "cython_src/prodigy/util.pyx", line 621, in prodigy.util.file_to_b64
  File "cython_src/prodigy/util.pyx", line 623, in prodigy.util.file_to_b64
  File "cython_src/prodigy/util.pyx", line 633, in prodigy.util.bytes_to_b64
AttributeError: module 'base64' has no attribute 'encodestring'

I'm using Prodigy nightly v1.11.0a8, Python 3.9.5

Thanks for the report, and ugh, looks like the builtin base64.encodestring is now base64.encodebytes on Python 3.9. So we'll need to add a special condition for that.

The easiest workaround for you would probably be to just use a Python 3.8 environment for now. Alternatively, you could also use your own function to encode the images as base64 strings (instead of using the Images loader). The function itself is pretty simple:

def bytes_to_b64(data, mimetype):
    encoded = base64.encodestring(data).splitlines()
    data64 = "".join([b.decode("utf8") for b in encoded])
    return f"data:{mimetype};base64,{data64}"

Hi @ines , many thanks for your solution! I eventually created another venv with Python 3.8.10 and all worked as a charm.

Another questions: is there a way to display the images for annotations larger? Currently, they are rendered as really tiny, where I need to see more fine details.

Thanks.

The image width (assuming the image itself is large enough) adjusts to the width of the annotation card. You can customise that via the cardMaxWidth setting in the custom theme: Web Application · Prodigy · An annotation tool for AI, Machine Learning & NLP The value can either be a number in pixels, or something like "90%", which will be relative to the available space.

Hi @ines, is prodigy nightly the only version that is compatible with model en_core_web_trf? I'm currently using prodigy version 1.10.8.

I just applied for the nightly program. Unfortunately, I lost my order ID as I switched computer, but I put the License Key. Would that be okay?

Hi!

Yes, en_core_web_trf (https://spacy.io/models/en#en_core_web_trf) is only compatible with spaCy v3 and Prodigy 1.11 (currently the "nightly"). In previous versions of spaCy (2.x) and Prodigy (1.10.x), there were similar Transformer models though, like en_trf_robertabase_lg (cf https://v2.spacy.io/models/en-starters).

Thanks. What's the difference between the two models? en_core_web_trf and en_trf_rovertbase_lg.
For the prodigy nightly application, how long usually does the approval process take?

The en_core_web_trf pipeline is a trained pipeline for spaCy v3 including all core components and initialised with transformer weights. You can read more about it here: English · spaCy Models Documentation

en_trf_robertabase_lg was a package we created for spaCy v2 and spacy-transformers that pre-packaged the RoBERTa weights. It's not relevant in spaCy v3 anymore where you can just initialise your pipeline with pretty much any pretrained transformers.

It shouldn't take longer than a day or two (unless you submitted the form on the weekend). If it's been longer, feel free to re-submit the form and make sure you've included your correct order ID and that your order includes the latest version of Prodigy and upgrades haven't expired.

I have applied for Prodigy nightly program, when may I expect email regarding confirmation?

Hi! See my comment above:

Email received, thanks!

1 Like

Hello, I am trying to train a NER and I get the following error:

python -m prodigy train --ner ner_dataset --base-model en_core_web_trf model_output_trf

 ⚠ Aborting and saving the final best model. Encountered exception:
TypeError("'FullTransformerBatch' object is not iterable")
Traceback (most recent call last):
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/prodigy/__main__.py", line 54, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/prodigy/recipes/train.py", line 244, in train
    return _train(
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/prodigy/recipes/train.py", line 172, in _train
    spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/spacy/training/loop.py", line 122, in train
    raise e
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/spacy/training/loop.py", line 105, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/spacy/training/loop.py", line 224, in train_while_improving
    score, other_scores = evaluate()
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/spacy/training/loop.py", line 281, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/spacy/language.py", line 1377, in evaluate
    for doc, eg in zip(
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/spacy/util.py", line 1488, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/spacy/util.py", line 1507, in raise_error
    raise e
  File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
  File "spacy/pipeline/tagger.pyx", line 111, in spacy.pipeline.tagger.Tagger.predict
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/borlogh/anaconda3/envs/fb_spacy31/lib/python3.8/site-packages/spacy_transformers/layers/trfs2arrays.py", line 23, in forward
    for trf_data in trf_datas:
TypeError: 'FullTransformerBatch' object is not iterable

I was able to fix the problem modifing the file spacy_transformers/layers/trfs2arrays.py adding the following code :

def forward(model: Model, trf_datas: List[TransformerData], is_train: bool):
    pooling: Model[Ragged, Floats2d] = model.layers[0]
    grad_factor = model.attrs["grad_factor"]
    outputs = []
    backprops = []

    # NEW CODE - BEGIN
    if not isinstance(trf_datas, list):
        trf_datas = trf_datas.doc_data # FullTransformerBatch -> List[TransformerData]
    # NEW CODE - END

    for trf_data in trf_datas:
        if len(trf_data.tensors) > 0:
            t_i = find_last_hidden(trf_data.tensors)
            tensor_t_i = trf_data.tensors[t_i]
            if tensor_t_i.size == 0:

what do you think about what the real problem might be?

I am using prodigy 1.11.0a8 and python 3.8.10

Hi!

Which spacy version do you have and could you try upgrading? This error msg reminds me of a bug that was fixed earlier. (hopefully)

I tried it with 3.0.6 and 3.1, and I get the same error

I installed the latest nighty for Linux (prodigy-1.11.0a8-cp36.cp37.cp38.cp39-cp36m.cp37m.cp38.cp39-linux_x86_64.whl) in a clean Python 3.8.6 virtualenv.

First the installer reported the following version conflicts

fastapi 0.66.0 requires starlette==0.14.2, but you'll have starlette 0.13.8 which is incompatible.
typer 0.3.2 requires click<7.2.0,>=7.1.1, but you'll have click 8.0.1 which is incompatible.
spacy 3.0.6 requires pydantic<1.8.0,>=1.7.1, but you'll have pydantic 1.8.2 which is incompatible.

In particular click==8.0.1 results in an exception

ModuleNotFoundError: No module named 'click._bashcomplete'

when running the prodigy command.

I resolved the issue by manually installing

pip install click==7.1.*

also for completeness

pip install pydantic==1.7.*

Finally fastapi 0.65.0 is the version which introduced the requirement for starlett==0.14.2 (Release Notes - FastAPI).

prodigy 1.11.0a8 requires starlette<0.14.0,>=0.12.9, but you'll have starlette 0.14.2 which is incompatible.
fastapi 0.66.0 requires starlette==0.14.2, but you'll have starlette 0.13.8 which is incompatible.

I backed this out to

pip install fastapi==0.64.*

It's looks really good! Other than click I'm not sure I needed to downgrade the other packages.

Did you have anything installed in your environment previously? I just updated the pydantic pin of Prodigy to match spaCy's, but aside from that, all dependencies install and resolve fine in isolation in our CI builds. (But with the new pip resolver, it's definitely possible to end up with conflicts if there's something else installed in the environment that depends on other versions of those packages.)

Just released a new nightly v1.11.0a10 that includes the following updates:

  • improved support for updating from binary annotations, especially those created with ner.teach
  • ner.teach will now also ask about texts with no entities – so if a suggestion doesn't include any suggestions, you can accept it if it has no entities and reject it if it does contain entities of the given label(s)
  • support for providing --spancat datasets for training spaCy v3.1's new SpanCategorizer in spacy train (with auto-generated suggester function)
  • support for validating created spans in spans.manual against suggester function
  • support for custom config or base model in prodigy train and data-to-spacy
  • support for providing --textcat and --textcat-multilabel (non-exclusive categories, including binary annotations) separately to prodigy train and data-to-spacy
  • sent.teach and sent.correct recipes for improving a sentence recognizer and support for --senter annotations in prodigy train an data-to-spacy
  • textcat.correct for correcting an existing text classifier
  • "_timestamp" property added to all created annotations reflecting the time the annotation was submitted in the UI
  • progress command for viewing annotation progress over time
  • ARM wheels
  • use the -F flag to pass in one or more comma-separated Python files to import from across all recipes to provide the recipe function, but also custom registered functions for spaC configs (e.g. in prodigy train)
  • fixes for various bugs introduced in the previous nightlies

Btw, after downloading an extracting the zip containing the wheel files, you can also run the following to automatically select the best-matching wheel for your platform:

pip install prodigy -f /path/to/wheels
3 Likes