✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more

@tristan19954: thanks for sharing these results! It would be good to get to the bottom of this.

It looks like there are 254 instances in the training dataset for 11 NER labels. This might be a bit too few, depending on how close your new annotations are to what the model was originally trained on. You don't have a separate evaluation dataset, so 63 cases were selected at random to do the evaluation on, but these 63 might not be represented by the 254 training instances? Again, this depends a bit on the variability of your training dataset. A way to test this, is to run an artificial experiment with -n ner_teach_july,eval:ner_teach_july, which will effectively train AND evaluate on the same dataset. You typically want to avoid this, but for making sure the training mechanism works as expected, it would be a good check.

An important point is that when training on the ner_teach_july dataset, it might be the case that the model starts "forgetting" about previously learned instances, and starts overfitting on this dataset. With the Prodigy nightly you should be able to prevent this by feeding an additional NER dataset, so you can train on the "teach" dataset and another dataset simultaneously. Ideally you'd have a separate evaluation dataset that you've used both to analyse the original performance as well as the performance after training on the "teach" dataset (rather than the random split used here).

@SofieVL Thank you for the answer!

I stopped arround 300 because after that I was getting similar problems to what I reported in a previous post back in april. I seems to be working way longer/better now, but after 320 it started to just select the whole sentence and highlight that as an entity, which is why I tried to update the model at that point.

I will try out what you suggested and keep you updated! I still have access to my original training and evaluation data so I will try mixing that in too

1 Like

Strange behavior is happening since I update prodigy from 1.11.0a8 to 1.11.0a10, it is using more GPU Memory. I run the exact same command in both environments and I got these results:

python -m prodigy train --ner fbsio-140721 --base-model en_core_web_trf --gpu-id 0 model_output_140721

1.11.0a8

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P0    41W /  70W |  10146MiB / 15109MiB |     87%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2910      C   python                          10143MiB |
+-----------------------------------------------------------------------------+

1.11.0a10

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0    31W /  70W |  15072MiB / 15109MiB |      80%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2451      C   python                          15069MiB |
+-----------------------------------------------------------------------------+

I've detected because I run out of memory:

⚠ Aborting and saving the final best model. Encountered exception:
RuntimeError('CUDA out of memory. Tried to allocate 264.00 MiB (GPU 0; 14.76 GiB
total capacity; 11.06 GiB already allocated; 37.75 MiB free; 11.91 GiB reserved
in total by PyTorch)')

Is it normal this difference in memory consumption?

Hey team, I already applied twice for the nightly program, but haven't received an email yet. Any suggestions?

@SofieVL The same issue persists when using -n ner_teach_july,eval:ner_teach_july

For comparison, have you tried the same training run with a non-transformer model? I wonder if this could be related to the transformer being less sensitive to these types of small and very sparse updates :thinking:

Could you send us an email to contact@explosion.ai and include your order ID? Then we can look into this internally :slightly_smiling_face:

Hi, I'm new to prodigy. Can I use prodigy in production with spacy 2 to generate output? The output is then used by spacy 3?

You can use Prodigy v1.10 (latest stable version) with spaCy v2 and export your annotations with data-to-spacy. In spaCy v3, you can convert this data to spaCy v3's new format with spacy convert and then use it to train a spaCy v3 model. You can also apply for the nightly (see first post above), which uses spaCy v3 by default.

Thanks, I applied for nightly one or two hours ago but I haven't received anything yet.

It indeed seems to be working with a CNN model :thinking:

Thanks for trying this out @tristan19954, that's really good to know. I still think that mixing in the original annotations would be the best solution overall, to prevent overfitting.

The other thing I want to point out is that you could also use prodigy data-to-spacy to create the data files and configuration file on disk, and with that you could run spacy train directly. The added advantage of that is that you'll get direct access to training parameters like training.optimizer.learn_rate and training.batcher in the config.cfg. It might make sense to experiment more with these settings to make them work better for the transformer-based pipelines. We'll have a look at this internally as well, whether perhaps we can change the defaults slightly to work better in general for these models. But the best settings may always depend a bit on the data...

[EDIT]: just to add to that, you can also run spacy init config --gpu which will create a config for a transformer-based pipeline, and copy the training settings from there. Those defaults will be slightly different than what Prodigy runs internally. Alternatively, you can even provide a base config to prodigy train with the -c parameter and then it'll take the settings from that.

1 Like

Is there a way of annotating NER spans using "layers" or "namespaces"? Whilst the UI for overlapping spans is excellent, I generally find it's easier from a UI perspective if possible to seperate different NER span annotations into different layers, annotations on different layers can overlap but annotations within a layer shouldn't. Each layer has it's own set of labels - perhaps one layer is high level features and another low level features for example.

Otherwise it gets quite hard to actually annotate. Plus when you're trying to decode the spans produced by a model, you might want to do so by layer as well - ie. pick the most likely set of non-overlapping high level features using a beam search, then so so again for the next layer down.

For now all the span annotations appear in the same json array from what I can see.

So in that case, what you describe would be more along the lines of having multiple ner_manual interfaces within the same UI? This is definitely something we want to support, by letting blocks customize the spans they read from and write to. So you can annotate the text multiple times with non-overlapping spans, and have the result saved to different keys.

The only part that's then tricky to support out-of-the-box is the conversion of the data for training, which you would probably want to do in a separate post-processing step. Bu we could potentially also support an integration with spaCy's SpanCategorizer – so you'd be able to specify the span keys in the data (e.g. "spans_level1", "spans_level2"), and those would be interpreted as separate SpanGroups the span categorizer can predict and add to the doc.spans.

Hi @ines, I wonder if your team is still on track to release 1.11 within this week. How do we know if it is already available? Thanks :'D

Hi, A few days ago I installed the latest nightly (currently on 1.11.0a11) but had a few questions about whether I was setting up the new training correctly, since I get different results/output than using Prodigy 1.10.8 (using Spacy V2). Would be great to get your view!

  1. When training on the same dataset and using same base model in the nightly, the resulting model uses a different tokenizer. Specifically it seems to default to the infix regex that splits on hyphens (the default Spacy tokenizer), while the version trained with 1.10.8 does not. I fixed this by commenting out a line in the spacy/lang/punctuation.py. Is this the consequence of my mis-specification (e.g. should i have added a config override) or is it a changed default or otherwise bug?

  2. This is probably caused by me not understanding how the new spacy pipelines work, but if I want to only train NER (similar to the simple "prodigy train ner" on old version) and use "prodigy train -n nas_ner_Gold_1 -m en_core_sci_lg -V -LS" i get an output with columns for the frozen pipeline elements (e.g. tagger, lemma, tok2vec) where all scores are 0. Also the score at the end is very low since it seems to average over the pipeline elements containing many 0's. Also, possibly linked to this, the training seems quite a lot slower than with the old version. I use a simple gold annotated dataset (no new combi with binary, that will come later and why i downloaded the nightly in the first place). See output below:

  3. I installed CUDA and all dependencies, but when i run -g 0 with a normal NER training task (using en_core_sci_lg from SciSpacy, no transformer model) training does not seem that much faster. Also GPU utilization is just 10% or so. Do I specify the command correctly or is it implicit that utilizing the GPU is only more efficient for certain training tasks, e.g. using transformer models? I also noticed in the resulting .cfg file that the gpu_allocator = null under [system]. If so, how would I do this, just run the same training but with en_core_web_trf or some other transformer base model?

Thanks a lot for your help!

Did you use a custom tokenizer when you trained with v1.10.8? There haven't been any changes to those particular tokenization rules between spaCy v2 and spaCy v3 as far as I can tell, so I'm not sure why you would be seeing a difference here. That said, if you do want to change the tokenization rules, you can do this by modifying the rules in your base model, saving it out and training with that. The changes should then be reflected in the final model. You just need to double-check that this doesn't cause any of your created annotations to be misaligned.

By default, frozen components are included in the evaluation, because you often want to evaluate them even if they're not updated, so that their scores are reflected in th final score. However, if your data doesn't include any annotations for those components, the scores will of course be 0.

The final score is based on the score_weights defined in the config: https://spacy.io/usage/training#metrics This lets you control which model you consider the "best", and makes it easy to optimise for different combinations (e.g. prefer a model with higher recall over higher precision). If you don't want to include certain scores, you can set their weights to null in the config.

This definitely brings up a good point: when Prodigy auto-generates the config, it should probably set the scores for frozen components to null by default, because in this case, we know that there's no data for them and there's no point in including them in th evaluation. So we'll update this!

Are you auto-generating the config with Prodigy? And if so, could you try specifying a transformer-based config explicitly using the --config argument, or training with spaCy directly? We're considering making an update to the auto-generation here that copies over the [training] block from the base model as well, since this typically includes relevant settings that Prodigy should respect.

I just wanted to report that I'm annotating with spacy 3.1.1 and its new grc support and the process is flawless. It is really a pleasure to use prodigy. I made about 500 annotations in an afternoon, and there was not a single problem.

2 Likes

Update: Prodigy v1.11 is out now! :tada::raised_hands: Thanks to the 300+ (!) nightly testers – your feedback was incredibly valuable. The new release includes tons of new features and improvements – check out the release notes here: https://prodi.gy/docs/changelog

Also see the installation docs for how to install Prodigy directly from PyPi using your license key for the smoothest and most convenient installation :fire: https://prodi.gy/docs/install

2 Likes

A post was split to a new topic: Using Prodigy PyPi server in requirements.txt or as index URL