Support for spaCy v2.1

(Matthew Honnibal) #2

Probably early next week. We’ve updated the code for it, and are doing some manual testing to make sure we don’t need to tweak any of the active learning heuristics for the new models, in case things changed.

We also wanted to make sure things were fully stable. With Prodigy it’s less convenient for users to install updates than it is for spaCy and other open-source libraries, so we’re a bit more cautious. v2.1 has been pretty well tested because it was on nightly for so long, but we still want to make sure any problems surface before we ask everyone to download a new Prodigy update and retrain all their models.

Match patterns without creating huge files

@honnibal May I just check with this - we’re currently using spacy pretrain on the prebuilt spacy models to prepare for compatibility, and using prodigy’s ner.match recipe to build a training dataset. Should we expect these/any other artefacts to break with the new update, or will we only need to retrain the models themselves?

(Ines Montani) #4

It’s really only the models :slightly_smiling_face:

In theory there is a possibility that the tokenization can differ for very specific edge cases. But it’s extremely unlikely that this would affect any of the entity spans you’ve annotated – for this to happen, the character offsets of the entities would have to not map to valid token boundaries anymore. But this is also something you can verify pretty easily yourself: for every span you’ve annotated in a document, Doc.char_span needs to succeed.


Excellent! That makes sense; thanks for the clarification. Looking forward to hearing about progress!

1 Like

Are there any updates on this? I just updated to spacy 2.1 and get the following error for “ner.teach”:

“ImportError: cannot import name _cleanup”

Is this due to spacy 2.1?

(Matthew Honnibal) #7

@BLP Yes, that’s due to spaCy v2.1.

We have a build of Prodigy that works with v2.1, but there’s one or two features we’d still like to add, especially pretraining support in the recipes. We also want to keep testing, as we want to make sure we give everyone a smooth experience.

Actually it would be useful to have some external testers as well. If you want to try it out, send us an email?

error loading prodigy (textcat.batch-train)ed model using spacy 2.1
(Nicolai Bjerre Pedersen) #8

I sent you an email the other day regarding testing from I’ll be happy to start testing the new version. I am streaming data from Google Firestore and will probably use that for saving the annotations as well. I’ll be using prodigy for textcat today and for a parser for custom semantics soon (somehow).

(Ines Montani) pinned globally #9
(Ronaldo V ) #10

We have a build of Prodigy that works with v2.1

is it possible to get the working version?

I just start with prodigy and since one year i work with spacy (successfull!). I get issues with my scripts while downgrading spacy and in this current environment it makes no sense to integrate prodigy to do this work again when a new version releases.

I’m sitting between the chairs…

(Ronaldo V ) #11

What’s up with testing the spacy-2.1 working Prodigy?
I allready wrote to contact but didn’t got an answer…

I’ve tried to install against 2.1.3 but it crashes with thinc and so i stopped “researching”.

(Matthew Honnibal) #12

Hi Ronaldo,

We didn’t receive your email about this. Apologies for that — if it’s still relevant, perhaps you could resend?

We’ve been testing the new version carefully as once it’s released people will need to retrain their models to upgrade, which is inconvenient. You can always export your Prodigy annotations and use them to train a spaCy v2.1 model, so it shouldn’t affect your total workflow to be using the current version of Prodigy. You shouldn’t have Prodigy in your production runtime, usually: there’s no reason to be running the annotation from the same environment you’re using to run your models in production.

We should be pushing a new patch release of spaCy today that fixes a couple of bugs. Once that’s out, I think the current build we have of Prodigy could be considered a release candidate. If there are no further problems found we’ll go ahead and make the v1.8 release. However, please do be patient.

relation between Tagger, Parser, and NER pipeline in spacy
(Ronaldo V ) #13

That sounds good.
I develop a system and currently it’s not productive. The NLP part is integrated and the annotations are used on a higher level evaluation process. In Spacy we use for example EntityRuler and are set to 2.1 otherwise. If we go back to the 2.0 there are problems with the current scripts. So the use of prodigy 1.7 is not productive to go there for debugging when an upgrade takes place in the near future.
So I don’t care about problems with 2.0 models. I have to integrate the frontend functionally and RC should be sufficient.
How do I get the RC / V1.8? My licence was bougt by my Company.

Blank spacy model without being trained
(TW) #14

+1 i would love to join the beta, for my current scenario of trying to use ner.match EDIT: also ner.teach :sob::

Without spaCy v2.1, I’m getting: ValueError: [T002] Pattern length (10) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.

With spaCy v2.1, I’m getting: ImportError: cannot import name _cleanup

(Ines Montani) #15

Just out of curiosity, what types of patterns do you have that are 10+ tokens long? While there can always be edge cases (like, trying to match a span with lots of punctuation etc. that ends up lots of tokens), you usually don’t want to be matching sequences this long when annotating in Prodigy. Keep in mind that phrase patterns (e.g. "pattern": "some string") really only return exact string matches. So unless your data really contains a lot of mentions of those exact strings, the pattern likely won’t be that useful, either.

(TW) #16

Yes, it is the case that I’m looking for exact pattern matches. In my case, the entities I’m looking for have a lot of overlap with generic, non-entity terms (imagine an entity Red Bumblebee which is the same as a generic term red bumblebee), and the text isn’t very long so there is not much context for the model to derive to determine if entity or generic – this is my elementary understanding of what’s possible though, so I will try your suggestion, at least that way I can get started and see how the results are looking :slight_smile:

(Ines Montani) #17

@trevorwelch Yes, this makes sense. So it looks like you might just have a few outliers in your patterns then? The pattern length limit only occurs if the string you want to match consists of 10 or more tokens in total. If those examples are important for your use case, you can always add annotations for them later – but working with shorter patterns only will probably still let you cover the most frequent entities.

1 Like
(Ines Montani) pinned globally #18
(Martin Galese) #19

Any updates here? Any beta that could be made available?


(Olivier) #20

I have just purchased but unforunately it forces me to go back to Spacy 2.0.18 when I have been working with 2.1.3 for several weeks.
I was expecting to be on the same version than spacy.
What should I do?

1 Like
(Ines Montani) #21

We’re hoping to have the release ready by the end of this week! If you want to test the beta wheels before that, feel free to send us an email.

As Matt mentioned above, the new spaCy version is very new and backwards-incompatible with the previous version when it comes to the models and matcher engine (both important features for Prodigy). We wanted to get spaCy v2.1 out as early as possible to give users enough time to make sure they’re able to upgrade their applications, before also asking all our Prodigy users to retrain all of their models. I also just shared some more background here:

1 Like