When do you expect support for spaCy v2.1?
Probably early next week. We’ve updated the code for it, and are doing some manual testing to make sure we don’t need to tweak any of the active learning heuristics for the new models, in case things changed.
We also wanted to make sure things were fully stable. With Prodigy it’s less convenient for users to install updates than it is for spaCy and other open-source libraries, so we’re a bit more cautious. v2.1 has been pretty well tested because it was on nightly for so long, but we still want to make sure any problems surface before we ask everyone to download a new Prodigy update and retrain all their models.
@honnibal May I just check with this - we’re currently using
spacy pretrain on the prebuilt spacy models to prepare for compatibility, and using prodigy’s
ner.match recipe to build a training dataset. Should we expect these/any other artefacts to break with the new update, or will we only need to retrain the models themselves?
It's really only the models
In theory there is a possibility that the tokenization can differ for very specific edge cases. But it's extremely unlikely that this would affect any of the entity spans you've annotated – for this to happen, the character offsets of the entities would have to not map to valid token boundaries anymore. But this is also something you can verify pretty easily yourself: for every span you've annotated in a document,
Doc.char_span needs to succeed.
Excellent! That makes sense; thanks for the clarification. Looking forward to hearing about progress!
Are there any updates on this? I just updated to spacy 2.1 and get the following error for “ner.teach”:
“ImportError: cannot import name _cleanup”
Is this due to spacy 2.1?
@BLP Yes, that’s due to spaCy v2.1.
We have a build of Prodigy that works with v2.1, but there’s one or two features we’d still like to add, especially pretraining support in the recipes. We also want to keep testing, as we want to make sure we give everyone a smooth experience.
Actually it would be useful to have some external testers as well. If you want to try it out, send us an email? email@example.com
I sent you an email the other day regarding testing from firstname.lastname@example.org. I’ll be happy to start testing the new version. I am streaming data from Google Firestore and will probably use that for saving the annotations as well. I’ll be using prodigy for
textcat today and for a parser for custom semantics soon (somehow).
We have a build of Prodigy that works with v2.1
is it possible to get the working version?
I just start with prodigy and since one year i work with spacy (successfull!). I get issues with my scripts while downgrading spacy and in this current environment it makes no sense to integrate prodigy to do this work again when a new version releases.
I’m sitting between the chairs…
What’s up with testing the spacy-2.1 working Prodigy?
I allready wrote to contact but didn’t got an answer…
I’ve tried to install against 2.1.3 but it crashes with thinc and so i stopped “researching”.
We didn’t receive your email about this. Apologies for that — if it’s still relevant, perhaps you could resend?
We’ve been testing the new version carefully as once it’s released people will need to retrain their models to upgrade, which is inconvenient. You can always export your Prodigy annotations and use them to train a spaCy v2.1 model, so it shouldn’t affect your total workflow to be using the current version of Prodigy. You shouldn’t have Prodigy in your production runtime, usually: there’s no reason to be running the annotation from the same environment you’re using to run your models in production.
We should be pushing a new patch release of spaCy today that fixes a couple of bugs. Once that’s out, I think the current build we have of Prodigy could be considered a release candidate. If there are no further problems found we’ll go ahead and make the v1.8 release. However, please do be patient.
That sounds good.
I develop a system and currently it’s not productive. The NLP part is integrated and the annotations are used on a higher level evaluation process. In Spacy we use for example
EntityRuler and are set to 2.1 otherwise. If we go back to the 2.0 there are problems with the current scripts. So the use of prodigy 1.7 is not productive to go there for debugging when an upgrade takes place in the near future.
So I don’t care about problems with 2.0 models. I have to integrate the frontend functionally and RC should be sufficient.
How do I get the RC / V1.8? My licence was bougt by my Company.
+1 i would love to join the beta, for my current scenario of trying to use
ner.match EDIT: also
Without spaCy v2.1, I’m getting:
ValueError: [T002] Pattern length (10) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.
With spaCy v2.1, I’m getting:
ImportError: cannot import name _cleanup
Just out of curiosity, what types of patterns do you have that are 10+ tokens long? While there can always be edge cases (like, trying to match a span with lots of punctuation etc. that ends up lots of tokens), you usually don't want to be matching sequences this long when annotating in Prodigy. Keep in mind that phrase patterns (e.g.
"pattern": "some string") really only return exact string matches. So unless your data really contains a lot of mentions of those exact strings, the pattern likely won't be that useful, either.
Yes, it is the case that I’m looking for exact pattern matches. In my case, the entities I’m looking for have a lot of overlap with generic, non-entity terms (imagine an entity
Red Bumblebee which is the same as a generic term
red bumblebee), and the text isn’t very long so there is not much context for the model to derive to determine if entity or generic – this is my elementary understanding of what’s possible though, so I will try your suggestion, at least that way I can get started and see how the results are looking
@trevorwelch Yes, this makes sense. So it looks like you might just have a few outliers in your patterns then? The pattern length limit only occurs if the string you want to match consists of 10 or more tokens in total. If those examples are important for your use case, you can always add annotations for them later – but working with shorter patterns only will probably still let you cover the most frequent entities.
Any updates here? Any beta that could be made available?
I have just purchased Prodi.gy but unforunately it forces me to go back to Spacy 2.0.18 when I have been working with 2.1.3 for several weeks.
I was expecting Prodi.gy to be on the same version than spacy.
What should I do?