In theory there is a possibility that the tokenization can differ for very specific edge cases. But it's extremely unlikely that this would affect any of the entity spans you've annotated – for this to happen, the character offsets of the entities would have to not map to valid token boundaries anymore. But this is also something you can verify pretty easily yourself: for every span you've annotated in a document, Doc.char_span needs to succeed.
We have a build of Prodigy that works with v2.1, but there’s one or two features we’d still like to add, especially pretraining support in the recipes. We also want to keep testing, as we want to make sure we give everyone a smooth experience.
Actually it would be useful to have some external testers as well. If you want to try it out, send us an email? email@example.com
I sent you an email the other day regarding testing from firstname.lastname@example.org. I’ll be happy to start testing the new version. I am streaming data from Google Firestore and will probably use that for saving the annotations as well. I’ll be using prodigy for textcat today and for a parser for custom semantics soon (somehow).
I just start with prodigy and since one year i work with spacy (successfull!). I get issues with my scripts while downgrading spacy and in this current environment it makes no sense to integrate prodigy to do this work again when a new version releases.
We didn’t receive your email about this. Apologies for that — if it’s still relevant, perhaps you could resend?
We’ve been testing the new version carefully as once it’s released people will need to retrain their models to upgrade, which is inconvenient. You can always export your Prodigy annotations and use them to train a spaCy v2.1 model, so it shouldn’t affect your total workflow to be using the current version of Prodigy. You shouldn’t have Prodigy in your production runtime, usually: there’s no reason to be running the annotation from the same environment you’re using to run your models in production.
We should be pushing a new patch release of spaCy today that fixes a couple of bugs. Once that’s out, I think the current build we have of Prodigy could be considered a release candidate. If there are no further problems found we’ll go ahead and make the v1.8 release. However, please do be patient.
That sounds good.
I develop a system and currently it’s not productive. The NLP part is integrated and the annotations are used on a higher level evaluation process. In Spacy we use for example EntityRuler and are set to 2.1 otherwise. If we go back to the 2.0 there are problems with the current scripts. So the use of prodigy 1.7 is not productive to go there for debugging when an upgrade takes place in the near future.
So I don’t care about problems with 2.0 models. I have to integrate the frontend functionally and RC should be sufficient.
How do I get the RC / V1.8? My licence was bougt by my Company.
Just out of curiosity, what types of patterns do you have that are 10+ tokens long? While there can always be edge cases (like, trying to match a span with lots of punctuation etc. that ends up lots of tokens), you usually don't want to be matching sequences this long when annotating in Prodigy. Keep in mind that phrase patterns (e.g. "pattern": "some string") really only return exact string matches. So unless your data really contains a lot of mentions of those exact strings, the pattern likely won't be that useful, either.
Yes, it is the case that I’m looking for exact pattern matches. In my case, the entities I’m looking for have a lot of overlap with generic, non-entity terms (imagine an entity Red Bumblebee which is the same as a generic term red bumblebee), and the text isn’t very long so there is not much context for the model to derive to determine if entity or generic – this is my elementary understanding of what’s possible though, so I will try your suggestion, at least that way I can get started and see how the results are looking
@trevorwelch Yes, this makes sense. So it looks like you might just have a few outliers in your patterns then? The pattern length limit only occurs if the string you want to match consists of 10 or more tokens in total. If those examples are important for your use case, you can always add annotations for them later – but working with shorter patterns only will probably still let you cover the most frequent entities.
I have just purchased Prodi.gy but unforunately it forces me to go back to Spacy 2.0.18 when I have been working with 2.1.3 for several weeks.
I was expecting Prodi.gy to be on the same version than spacy.
What should I do?
We're hoping to have the release ready by the end of this week! If you want to test the beta wheels before that, feel free to send us an email.
As Matt mentioned above, the new spaCy version is very new and backwards-incompatible with the previous version when it comes to the models and matcher engine (both important features for Prodigy). We wanted to get spaCy v2.1 out as early as possible to give users enough time to make sure they're able to upgrade their applications, before also asking all our Prodigy users to retrain all of their models. I also just shared some more background here:
Btw, quick update in case you missed it: We released v1.8 yesterday, which introduces support for spaCy v2.1, including a bunch of other new features. See here for the full release notes: https://prodi.gy/docs/changelog