Ah, it's interesting you mention this because I've actually been thinking about this for a while! I've been meaning to look into the custom components again now that the feature is more stable. One aspect that will make the Prodigy components a bit more involved is that Prodigy is inherently stateful and a lot of the most relevant logic happens on the back-end. But I think we can come up with a solution. Definitely good to know that you'd be interested in the feature so I can reach out once there's something to beta test
Thanks for the report, that's strange! I think there must have been a problem with compiling the nightly. Just triggered a new build and tested it locally and it fixed the issue. Will release v1.11.0a8 shortly!
A question about expected behaviour and updating existing models.
Prior to using the nightly build I was using prodigy ner.teach method to annotate "edge cases" and found this super useful. I then went on to use prodigy train with a base model and it updated an existing spacy model I was using. The result was that I got most of the entities from the spacy model but with more of the entities I cared about (and more accurate tagging of them). It worked great.
Using the nightly version I changed my approach to the config training and I have effectively lost all of the benefits of using an existing spacy model with my new model severely over fitting to these edge case examples. Is this expected? Can you advise me on what I might focus on changing in the config to more closely replicate the results from the original prodigy train with a base model?
At the moment, we do expect updating from binary annotations to produce worse results with the nightly – sorry about that! There are a few small changes we still need to get into spaCy to allow updating the components with all constraints defined by the binary answers. This is most likely the effect your seeing here and it's the main update we still need to release the stable version of Prodigy for spaCy v3
I hadn't considered this yet, but it's a good idea! I'll put this on my list of enhancements for the upcoming version. In the meantime, it should hopefully be pretty straightforward to implement your own loader for this.
For the built-in loader, we might have to make this a separate command that outputs the JSON you can pipe forward, or come up with some special syntax so you can specify which annotations to extract from the Doc objects. For example, in ner.manual you'll (likely) want the doc.ents to be added to the "spans", but in another recipe, you might want to use the part-of-speech tags instead, etc.
I trained a custom tagger model with Spacy 3.0.5 and Prodigy 1.11.0a8, using the same dataset in json format I used for training a custom model with Spacy V2. Although token.tag_ returns tags, token.pos_ returns nothing:
doc = nlp("I am hungry.")
print([token.tag_ for token in doc])
['PRP', 'VBP', 'JJ', '.']
doc = nlp("I am hungry.")
print([token.pos_ for token in doc])
['', '', '', '']
Prodigy is such a friendly and well-thought tool. I wish to try the nightly version as I want to use Spacy v3. Could you please advise me on how to get the download link? I signed up for the program but have not received an automated email yet.
I have a spacy docbin of annotated sentence boundaries. How do I use that corpus as source data to textcat or ner annotation in Prodigy? Do I need to write a custom loader for spacy 3 docbin objects saved on disk?
At the moment, yes – also see my comment above. It should be pretty straightforward – you can load DocBin from disk, get the Doc objects and then create dictionaries based on the data you need from them. (That's also what makes a generic .spacy loader a bit tricky: the Doc objects may contain various different annotations and Prodigy can't easily guess which ones you want to include in the data.)
Is there a way to configure a custom tokenizer for prodigy data-to-spacy or prodigy train? With Spacy 2/Prodigy 1.10 we used to load a model, patch the tokenizer with additional rules, save and then use that as a base model in prodigy train. I'm not sure but I assume it was using the serialized tokenizer from that model.
With Spacy 3 we switched to a initialize callback to customize the tokenizer.
We create a base model using spacy assemble config-with-init.cfg base-model --code custom_tokenizer.py. But using that model in prodigy data-to-spacy -m base-model... fails because the callback code is not on the python path. If I remove the before_init from the config of the assembled base model our custom tokenizer is not used. I'm not sure if it is not serialized at all or it is not loaded from the base model.
Any idea how to make this work without having to package the callback code separately? Thanks.
I think I was wrong. prodigy data-to-spacydoes use the customized serialized tokenizer created via spacy assemble. Now I can in turn do a spacy train config.cfg --paths.train train.spacy --paths.dev dev.spacy --code custom_tokenizer.py with the exported data. That works but I feel there must be a simpler way.
@sbrunk: If the custom callback is in [initialize], then you should only need it when you run spacy assemble and not when the model is loaded later because the settings are saved in the model.
If you need a custom code every time the model is loaded (custom component architecture/factory, callback in nlp, etc.), then the easiest way is to package the model with spacy package and install it with pip. Then you can specify the model name with -b just like en_core_web_sm.
If you don't want to use spacy assemble, you can still do what you did for v3 where you modify the tokenizer in a loaded model and save it to disk with nlp.to_disk. It will only load correctly as from a path/directory if there is no custom code requried, otherwise you would still need to use spacy package and install it.
Thanks! And yes, that's no problem if you use separate virtual environments. (In general, we'd always recommend using virtual environments instead of installing things into the system Python. It makes it much easier to start over if you end up in a weird state, and you can run different versions of libraries for different projects.)