✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more

It indeed seems to be working with a CNN model :thinking:

Thanks for trying this out @tristan19954, that's really good to know. I still think that mixing in the original annotations would be the best solution overall, to prevent overfitting.

The other thing I want to point out is that you could also use prodigy data-to-spacy to create the data files and configuration file on disk, and with that you could run spacy train directly. The added advantage of that is that you'll get direct access to training parameters like training.optimizer.learn_rate and training.batcher in the config.cfg. It might make sense to experiment more with these settings to make them work better for the transformer-based pipelines. We'll have a look at this internally as well, whether perhaps we can change the defaults slightly to work better in general for these models. But the best settings may always depend a bit on the data...

[EDIT]: just to add to that, you can also run spacy init config --gpu which will create a config for a transformer-based pipeline, and copy the training settings from there. Those defaults will be slightly different than what Prodigy runs internally. Alternatively, you can even provide a base config to prodigy train with the -c parameter and then it'll take the settings from that.

1 Like

Is there a way of annotating NER spans using "layers" or "namespaces"? Whilst the UI for overlapping spans is excellent, I generally find it's easier from a UI perspective if possible to seperate different NER span annotations into different layers, annotations on different layers can overlap but annotations within a layer shouldn't. Each layer has it's own set of labels - perhaps one layer is high level features and another low level features for example.

Otherwise it gets quite hard to actually annotate. Plus when you're trying to decode the spans produced by a model, you might want to do so by layer as well - ie. pick the most likely set of non-overlapping high level features using a beam search, then so so again for the next layer down.

For now all the span annotations appear in the same json array from what I can see.

So in that case, what you describe would be more along the lines of having multiple ner_manual interfaces within the same UI? This is definitely something we want to support, by letting blocks customize the spans they read from and write to. So you can annotate the text multiple times with non-overlapping spans, and have the result saved to different keys.

The only part that's then tricky to support out-of-the-box is the conversion of the data for training, which you would probably want to do in a separate post-processing step. Bu we could potentially also support an integration with spaCy's SpanCategorizer – so you'd be able to specify the span keys in the data (e.g. "spans_level1", "spans_level2"), and those would be interpreted as separate SpanGroups the span categorizer can predict and add to the doc.spans.

Hi @ines, I wonder if your team is still on track to release 1.11 within this week. How do we know if it is already available? Thanks :'D

Hi, A few days ago I installed the latest nightly (currently on 1.11.0a11) but had a few questions about whether I was setting up the new training correctly, since I get different results/output than using Prodigy 1.10.8 (using Spacy V2). Would be great to get your view!

  1. When training on the same dataset and using same base model in the nightly, the resulting model uses a different tokenizer. Specifically it seems to default to the infix regex that splits on hyphens (the default Spacy tokenizer), while the version trained with 1.10.8 does not. I fixed this by commenting out a line in the spacy/lang/punctuation.py. Is this the consequence of my mis-specification (e.g. should i have added a config override) or is it a changed default or otherwise bug?

  2. This is probably caused by me not understanding how the new spacy pipelines work, but if I want to only train NER (similar to the simple "prodigy train ner" on old version) and use "prodigy train -n nas_ner_Gold_1 -m en_core_sci_lg -V -LS" i get an output with columns for the frozen pipeline elements (e.g. tagger, lemma, tok2vec) where all scores are 0. Also the score at the end is very low since it seems to average over the pipeline elements containing many 0's. Also, possibly linked to this, the training seems quite a lot slower than with the old version. I use a simple gold annotated dataset (no new combi with binary, that will come later and why i downloaded the nightly in the first place). See output below:

  3. I installed CUDA and all dependencies, but when i run -g 0 with a normal NER training task (using en_core_sci_lg from SciSpacy, no transformer model) training does not seem that much faster. Also GPU utilization is just 10% or so. Do I specify the command correctly or is it implicit that utilizing the GPU is only more efficient for certain training tasks, e.g. using transformer models? I also noticed in the resulting .cfg file that the gpu_allocator = null under [system]. If so, how would I do this, just run the same training but with en_core_web_trf or some other transformer base model?

Thanks a lot for your help!

Did you use a custom tokenizer when you trained with v1.10.8? There haven't been any changes to those particular tokenization rules between spaCy v2 and spaCy v3 as far as I can tell, so I'm not sure why you would be seeing a difference here. That said, if you do want to change the tokenization rules, you can do this by modifying the rules in your base model, saving it out and training with that. The changes should then be reflected in the final model. You just need to double-check that this doesn't cause any of your created annotations to be misaligned.

By default, frozen components are included in the evaluation, because you often want to evaluate them even if they're not updated, so that their scores are reflected in th final score. However, if your data doesn't include any annotations for those components, the scores will of course be 0.

The final score is based on the score_weights defined in the config: Training Pipelines & Models · spaCy Usage Documentation This lets you control which model you consider the "best", and makes it easy to optimise for different combinations (e.g. prefer a model with higher recall over higher precision). If you don't want to include certain scores, you can set their weights to null in the config.

This definitely brings up a good point: when Prodigy auto-generates the config, it should probably set the scores for frozen components to null by default, because in this case, we know that there's no data for them and there's no point in including them in th evaluation. So we'll update this!

Are you auto-generating the config with Prodigy? And if so, could you try specifying a transformer-based config explicitly using the --config argument, or training with spaCy directly? We're considering making an update to the auto-generation here that copies over the [training] block from the base model as well, since this typically includes relevant settings that Prodigy should respect.

I just wanted to report that I'm annotating with spacy 3.1.1 and its new grc support and the process is flawless. It is really a pleasure to use prodigy. I made about 500 annotations in an afternoon, and there was not a single problem.

2 Likes

Update: Prodigy v1.11 is out now! :tada::raised_hands: Thanks to the 300+ (!) nightly testers – your feedback was incredibly valuable. The new release includes tons of new features and improvements – check out the release notes here: https://prodi.gy/docs/changelog

Also see the installation docs for how to install Prodigy directly from PyPi using your license key for the smoothest and most convenient installation :fire: https://prodi.gy/docs/install

2 Likes

A post was split to a new topic: Using Prodigy PyPi server in requirements.txt or as index URL

Hi, I would like to use the --show-plot function while performing a train-curve diagnostic test. I have installed plotext (v 4.1.5) and am running Prodigy v 1.11.7. However, when I run

python -m prodigy train-curve --ner sents_annotations --base-model ./embeddings/Spacy3_FastText --eval-split 0.2 --show-plot

I get the following message:

✘ Train curve plots require the plotext library
pip install plotext

I've deactivated and then reactivated my environment after installing plotext, too.

Thanks,

Darren

Are you sure you've installed it in the same environment? Maybe your pip and python point to different envs? You can check this by running which pip and which python. The error message is shown if import plotext fails so you can also check it in your Python interpreter manually and make sure it's found.

1 Like

Thanks Ines. Plotext was actually installed in the right env. However, when I tried import plotext in my python interpreter, I got the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\__init__.py", line 3, in <module>
    from plotext._core import *
  File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\_core.py", line 8, in <module>
    from plotext._utility.plot import terminal_size as _terminal_size
  File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\_utility\plot.py", line 2, in <module>
    from plotext._utility.marker import sum_markers, refine_marker, space_marker, side_symbols
  File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\_utility\marker.py", line 3, in <module>
    from plotext._utility.platform import _platform
  File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\_utility\platform.py", line 24, in <module>
    import win_unicode_console
ModuleNotFoundError: No module named 'win_unicode_console'

After installing the win_unicode_console package, plotext started working.

Thanks,

Darren :slight_smile: