It indeed seems to be working with a CNN model
Thanks for trying this out @tristan19954, that's really good to know. I still think that mixing in the original annotations would be the best solution overall, to prevent overfitting.
The other thing I want to point out is that you could also use prodigy data-to-spacy
to create the data files and configuration file on disk, and with that you could run spacy train
directly. The added advantage of that is that you'll get direct access to training parameters like training.optimizer.learn_rate
and training.batcher
in the config.cfg
. It might make sense to experiment more with these settings to make them work better for the transformer-based pipelines. We'll have a look at this internally as well, whether perhaps we can change the defaults slightly to work better in general for these models. But the best settings may always depend a bit on the data...
[EDIT]: just to add to that, you can also run spacy init config --gpu
which will create a config for a transformer-based pipeline, and copy the training
settings from there. Those defaults will be slightly different than what Prodigy runs internally. Alternatively, you can even provide a base config to prodigy train
with the -c
parameter and then it'll take the settings from that.
Is there a way of annotating NER spans using "layers" or "namespaces"? Whilst the UI for overlapping spans is excellent, I generally find it's easier from a UI perspective if possible to seperate different NER span annotations into different layers, annotations on different layers can overlap but annotations within a layer shouldn't. Each layer has it's own set of labels - perhaps one layer is high level features and another low level features for example.
Otherwise it gets quite hard to actually annotate. Plus when you're trying to decode the spans produced by a model, you might want to do so by layer as well - ie. pick the most likely set of non-overlapping high level features using a beam search, then so so again for the next layer down.
For now all the span annotations appear in the same json array from what I can see.
So in that case, what you describe would be more along the lines of having multiple ner_manual
interfaces within the same UI? This is definitely something we want to support, by letting blocks
customize the spans
they read from and write to. So you can annotate the text multiple times with non-overlapping spans, and have the result saved to different keys.
The only part that's then tricky to support out-of-the-box is the conversion of the data for training, which you would probably want to do in a separate post-processing step. Bu we could potentially also support an integration with spaCy's SpanCategorizer
– so you'd be able to specify the span keys in the data (e.g. "spans_level1"
, "spans_level2"
), and those would be interpreted as separate SpanGroup
s the span categorizer can predict and add to the doc.spans
.
Hi @ines, I wonder if your team is still on track to release 1.11 within this week. How do we know if it is already available? Thanks :'D
Hi, A few days ago I installed the latest nightly (currently on 1.11.0a11) but had a few questions about whether I was setting up the new training correctly, since I get different results/output than using Prodigy 1.10.8 (using Spacy V2). Would be great to get your view!
-
When training on the same dataset and using same base model in the nightly, the resulting model uses a different tokenizer. Specifically it seems to default to the infix regex that splits on hyphens (the default Spacy tokenizer), while the version trained with 1.10.8 does not. I fixed this by commenting out a line in the spacy/lang/punctuation.py. Is this the consequence of my mis-specification (e.g. should i have added a config override) or is it a changed default or otherwise bug?
-
This is probably caused by me not understanding how the new spacy pipelines work, but if I want to only train NER (similar to the simple "prodigy train ner" on old version) and use "prodigy train -n nas_ner_Gold_1 -m en_core_sci_lg -V -LS" i get an output with columns for the frozen pipeline elements (e.g. tagger, lemma, tok2vec) where all scores are 0. Also the score at the end is very low since it seems to average over the pipeline elements containing many 0's. Also, possibly linked to this, the training seems quite a lot slower than with the old version. I use a simple gold annotated dataset (no new combi with binary, that will come later and why i downloaded the nightly in the first place). See output below:
-
I installed CUDA and all dependencies, but when i run -g 0 with a normal NER training task (using en_core_sci_lg from SciSpacy, no transformer model) training does not seem that much faster. Also GPU utilization is just 10% or so. Do I specify the command correctly or is it implicit that utilizing the GPU is only more efficient for certain training tasks, e.g. using transformer models? I also noticed in the resulting .cfg file that the gpu_allocator = null under [system]. If so, how would I do this, just run the same training but with en_core_web_trf or some other transformer base model?
Thanks a lot for your help!
Did you use a custom tokenizer when you trained with v1.10.8? There haven't been any changes to those particular tokenization rules between spaCy v2 and spaCy v3 as far as I can tell, so I'm not sure why you would be seeing a difference here. That said, if you do want to change the tokenization rules, you can do this by modifying the rules in your base model, saving it out and training with that. The changes should then be reflected in the final model. You just need to double-check that this doesn't cause any of your created annotations to be misaligned.
By default, frozen components are included in the evaluation, because you often want to evaluate them even if they're not updated, so that their scores are reflected in th final score. However, if your data doesn't include any annotations for those components, the scores will of course be 0
.
The final score is based on the score_weights
defined in the config: Training Pipelines & Models · spaCy Usage Documentation This lets you control which model you consider the "best", and makes it easy to optimise for different combinations (e.g. prefer a model with higher recall over higher precision). If you don't want to include certain scores, you can set their weights to null
in the config.
This definitely brings up a good point: when Prodigy auto-generates the config, it should probably set the scores for frozen components to null
by default, because in this case, we know that there's no data for them and there's no point in including them in th evaluation. So we'll update this!
Are you auto-generating the config with Prodigy? And if so, could you try specifying a transformer-based config explicitly using the --config
argument, or training with spaCy directly? We're considering making an update to the auto-generation here that copies over the [training]
block from the base model as well, since this typically includes relevant settings that Prodigy should respect.
I just wanted to report that I'm annotating with spacy 3.1.1 and its new grc support and the process is flawless. It is really a pleasure to use prodigy. I made about 500 annotations in an afternoon, and there was not a single problem.
Update: Prodigy v1.11 is out now! Thanks to the 300+ (!) nightly testers – your feedback was incredibly valuable. The new release includes tons of new features and improvements – check out the release notes here: https://prodi.gy/docs/changelog
Also see the installation docs for how to install Prodigy directly from PyPi using your license key for the smoothest and most convenient installation https://prodi.gy/docs/install
A post was split to a new topic: Using Prodigy PyPi server in requirements.txt or as index URL
Hi, I would like to use the --show-plot function while performing a train-curve diagnostic test. I have installed plotext (v 4.1.5) and am running Prodigy v 1.11.7. However, when I run
python -m prodigy train-curve --ner sents_annotations --base-model ./embeddings/Spacy3_FastText --eval-split 0.2 --show-plot
I get the following message:
✘ Train curve plots require the plotext library
pip install plotext
I've deactivated and then reactivated my environment after installing plotext, too.
Thanks,
Darren
Are you sure you've installed it in the same environment? Maybe your pip
and python
point to different envs? You can check this by running which pip
and which python
. The error message is shown if import plotext
fails so you can also check it in your Python interpreter manually and make sure it's found.
Thanks Ines. Plotext was actually installed in the right env. However, when I tried import plotext
in my python interpreter, I got the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\__init__.py", line 3, in <module>
from plotext._core import *
File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\_core.py", line 8, in <module>
from plotext._utility.plot import terminal_size as _terminal_size
File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\_utility\plot.py", line 2, in <module>
from plotext._utility.marker import sum_markers, refine_marker, space_marker, side_symbols
File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\_utility\marker.py", line 3, in <module>
from plotext._utility.platform import _platform
File "C:\Projects\project-name\prodigy_env\lib\site-packages\plotext\_utility\platform.py", line 24, in <module>
import win_unicode_console
ModuleNotFoundError: No module named 'win_unicode_console'
After installing the win_unicode_console
package, plotext started working.
Thanks,
Darren