✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more

Hello, thank you for giving me access to Prodigy nightly, it really corresponds to what I was searching for! Quick question, is it no longer necessary to add the '--binary' flag at the end of the training command when doing ner.teach? I don't see it when checking the train --help

Hi Ines,

I am using pip. And I think you are correct in my case pip's dependency resolver pulling Spacy V3. This is a strange situation for me.

But after installing Nightly the system is working fine. I found a problem though. I put "html_template": false into my "prodigy.json" file. Then I noticed my annotator is coming empty.


After that I changed the "prodigy.json" and made "html_template": "{{text}}". Then I started getting annotator with keys.

Could you have a look and let me know what is wrong here?

Thanks,
Debo

Yes, that's the idea – although, it's one of the few outstanding updates we still need to make to spaCy / Prodigy for this stable release. So at the moment, you may be seeing worse results when updating from binary annotations only.

Ah, that's interesting. One possible explanation is that you ended up in a weird environment state with some leftover dependencies installed that are only compatible with spaCy v3 (but not immediately obvious). So the dependency resolver will try to find the next possible version that's compatible with whatever is in your environment, and that ends up being spaCy v3.

You shouldn't have to put that in your prodigy.json – this will override the html_template setting for all recipes, which is typically not what you want (and Prodigy should also show you a warning in the terminal if that happens). That's also why the sense2vec.teach UI is empty, because it's overriding the existing HTML template it relies on to render the suggestions. If you take it out, it should use the template defined by the recipe.

1 Like

Hi there, prodigy folks!
I tried to prepare data from prodigy (nightly) with the data-to-spacy command (since I wanted to train with spacy 3).

I get the following error (last line of traceback):
ValueError: Can't read file: assets/orth_variants.json

it seems to happen during the generation of the cached label data:

:information_source: Using base model 'de_core_news_md'

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components

  • [ner] Training: 789 | Evaluation: 197 (20% split)
    Training: 789 | Evaluation: 197
    Labels: ner (1)
  • [ner] MYLABEL
    :heavy_check_mark: Saved 789 training examples
    TESTOUT/train.spacy
    :heavy_check_mark: Saved 197 evaluation examples
    TESTOUT/dev.spacy

============================= Generating config =============================
:information_source: Using config from base model
:heavy_check_mark: Generated training config

======================== Generating cached label data ========================
....

ValueError: Can't read file: assets/orth_variants.json

I know that these variants play a role for the augment option in spacy 3 config files for training and is maybe invoked since I'm using the pretrained de_core_news_md model as base ... but I have no clue how to get around this error.

the full command I used is:
pgy data-to-spacy --lang de --ner test_de --eval-split 0.2 -m de_core_news_md \
--verbose TESTOUT

Although I get the train.spacy and dev.spacy (for which I can now create a config to train spacy3 via other routes), I wonder why the orth_variants.json error turns up and how to avoid that.

Ah, it seems like this is another and probably more common edge case with Prodigy's logic to auto-generate a config from a base model (also just came up in this thread): it currently copies the entire config, including the initialization settings that only run before training. Every model's config records those settings so you know exactly how the artifact was created – but the settings may refer to external files or code that's not required at runtime and not necessarily included. So what you're getting here is the exact

This is a tricky problem and I need to think about how we can best solve it :thinking: One the one hand, using a base model should give you the exact same config settings so you can train with the same configuration as the original pipeline. On the other hand, we need to guard against missing resources and references because otherwise, using a base model will likely fail 80% of the time. But we also can't make any assumptions about the model initialization settings because those could be anything (especially since we also want to support third-party pipelines like scispaCy etc.) :thinking::thinking::thinking:

In the meantime, you can find the orth variants here: https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_orth_variants.json You can probably also remove this part because it's mostly an extra for data augmentation but not necessarily required if you're updating the model with more data.

Hello! I am currently using v1.11.0a5 and I really like the UI to annotate spans (spans.manual).
I tried to use the review recipe for a dataset created with spans.manual but it does not seem to be supported. Could you please confirm that ?

Annotation using spans.manual :

Review (tried with and without --view-id spans_manual):

Thanks

Hello,

I was previously using prodigy prodigy-1.10.7, no complaints and found it super useful!

I recently updated to nightly prodigy-1.11.0a6 and am having issues running prodigy train in my command line. Even when running:

prodigy train --help

I get the following:
argparse.ArgumentError: argument -t/--senter: conflicting option string: -t

I'm new to prodigy so it could be something I have done - but I wasn't having any issues until the update and I don't think I am setting any option -t.

Thanks in advance for any help!

Ah, interesting! It should be supported (at least that's what's intended) but since it's a new interface, it's possible that we forgot to add or updated something related to the review UI. I'll look into this!

Sorry, this was a bug in the very latest nightly build! Also see here: ArgumentError for prodigy train on v1.11.0a6 - #3 by ines Just building new wheels and will release a new nightly with the fix shortly :slightly_smiling_face: Edit: Done and fixed in v1.11.0a7!

2 Likes

Just applied for the nightly program! :crossed_fingers:

I'm curious if there any thoughts about integration with Streamlit? We really like building workflow specific tools in Streamlit, and would be interested in embedding parts of Prodigy in the Streamlit reports. i.e., we'd pay for a Prodigy module that provided Streamlit components.

Hi! I was trying the spans.manual recipe in v1.11.0a7 and it shows the ner.manual recipe instead. I had to go back to v1.11.0a5 to get the spans.manual recipe to work.

Ah, it's interesting you mention this because I've actually been thinking about this for a while! I've been meaning to look into the custom components again now that the feature is more stable. One aspect that will make the Prodigy components a bit more involved is that Prodigy is inherently stateful and a lot of the most relevant logic happens on the back-end. But I think we can come up with a solution. Definitely good to know that you'd be interested in the feature so I can reach out once there's something to beta test :smiley:

Thanks for the report, that's strange! I think there must have been a problem with compiling the nightly. Just triggered a new build and tested it locally and it fixed the issue. Will release v1.11.0a8 shortly!

1 Like

Hello,

A question about expected behaviour and updating existing models.

Prior to using the nightly build I was using prodigy ner.teach method to annotate "edge cases" and found this super useful. I then went on to use prodigy train with a base model and it updated an existing spacy model I was using. The result was that I got most of the entities from the spacy model but with more of the entities I cared about (and more accurate tagging of them). It worked great.

Using the nightly version I changed my approach to the config training and I have effectively lost all of the benefits of using an existing spacy model with my new model severely over fitting to these edge case examples. Is this expected? Can you advise me on what I might focus on changing in the config to more closely replicate the results from the original prodigy train with a base model?

Thanks for all your work!

At the moment, we do expect updating from binary annotations to produce worse results with the nightly – sorry about that! There are a few small changes we still need to get into spaCy to allow updating the components with all constraints defined by the binary answers. This is most likely the effect your seeing here and it's the main update we still need to release the stable version of Prodigy for spaCy v3 :slightly_smiling_face:

1 Like

Thanks for the response! I am excited to play with it when it's ready.

Is it possible to use a serialized DocBin from Spacy 3 as the source for the recipes or do I need to write custom loader for Prodigy to read .spacy files?

I hadn't considered this yet, but it's a good idea! :smiley: I'll put this on my list of enhancements for the upcoming version. In the meantime, it should hopefully be pretty straightforward to implement your own loader for this.

For the built-in loader, we might have to make this a separate command that outputs the JSON you can pipe forward, or come up with some special syntax so you can specify which annotations to extract from the Doc objects. For example, in ner.manual you'll (likely) want the doc.ents to be added to the "spans", but in another recipe, you might want to use the part-of-speech tags instead, etc.

I trained a custom tagger model with Spacy 3.0.5 and Prodigy 1.11.0a8, using the same dataset in json format I used for training a custom model with Spacy V2. Although token.tag_ returns tags, token.pos_ returns nothing:

doc = nlp("I am hungry.")
print([token.tag_ for token in doc])
['PRP', 'VBP', 'JJ', '.']

doc = nlp("I am hungry.")
print([token.pos_ for token in doc])
['', '', '', '']

How can I get POS labels?

Thanks in advance.

Prodigy is such a friendly and well-thought tool. I wish to try the nightly version as I want to use Spacy v3. Could you please advise me on how to get the download link? I signed up for the program but have not received an automated email yet.

I have a spacy docbin of annotated sentence boundaries. How do I use that corpus as source data to textcat or ner annotation in Prodigy? Do I need to write a custom loader for spacy 3 docbin objects saved on disk?

Hi! Did you succeed using the en_vectors_web_lg? It seems that it is no longer available for spaCy v3.