✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more

At the moment, yes – also see my comment above. It should be pretty straightforward – you can load DocBin from disk, get the Doc objects and then create dictionaries based on the data you need from them. (That's also what makes a generic .spacy loader a bit tricky: the Doc objects may contain various different annotations and Prodigy can't easily guess which ones you want to include in the data.)

You can use the en_core_web_lg package instead :slightly_smiling_face: Alternatively, spaCy v3 also provides an init vectors command that lets you create your own vectors package (e.g. from any FastText vectors): Linguistic Features · spaCy Usage Documentation

@SofieVL Any idea here? Your help would be much appreciated.

Hi @inceatakan, apologies - it looks like we missed your earlier question!

Since v3, the tagger only assigns the token.tag attribute, and not token.pos anymore. The mapping between the two is now done by the attribute_ruler. You can find some more details in the v3 migration guide and in the usage documentation.

Is there a way to configure a custom tokenizer for prodigy data-to-spacy or prodigy train? With Spacy 2/Prodigy 1.10 we used to load a model, patch the tokenizer with additional rules, save and then use that as a base model in prodigy train. I'm not sure but I assume it was using the serialized tokenizer from that model.

With Spacy 3 we switched to a initialize callback to customize the tokenizer.
We create a base model using spacy assemble config-with-init.cfg base-model --code custom_tokenizer.py. But using that model in prodigy data-to-spacy -m base-model... fails because the callback code is not on the python path. If I remove the before_init from the config of the assembled base model our custom tokenizer is not used. I'm not sure if it is not serialized at all or it is not loaded from the base model.

Any idea how to make this work without having to package the callback code separately? Thanks.

I think I was wrong. prodigy data-to-spacy does use the customized serialized tokenizer created via spacy assemble. Now I can in turn do a spacy train config.cfg --paths.train train.spacy --paths.dev dev.spacy --code custom_tokenizer.py with the exported data. That works but I feel there must be a simpler way.

@sbrunk: If the custom callback is in [initialize], then you should only need it when you run spacy assemble and not when the model is loaded later because the settings are saved in the model.

If you need a custom code every time the model is loaded (custom component architecture/factory, callback in nlp, etc.), then the easiest way is to package the model with spacy package and install it with pip. Then you can specify the model name with -b just like en_core_web_sm.

If you don't want to use spacy assemble, you can still do what you did for v3 where you modify the tokenizer in a loaded model and save it to disk with nlp.to_disk. It will only load correctly as from a path/directory if there is no custom code requried, otherwise you would still need to use spacy package and install it.

1 Like

Thanks a lot for your help @adriane. Right now the callback only changes the tokenizer so spacy package shouldn't be necessary (yet). Might make sense to try it nevertheless though.

Thank you very much @SofieVL !

Yours,
-Atakan

Hi, nice features you have there! Just wondering, is it possible to install both stable and nightly in the same machine?

Thanks! And yes, that's no problem if you use separate virtual environments. (In general, we'd always recommend using virtual environments instead of installing things into the system Python. It makes it much easier to start over if you end up in a weird state, and you can run different versions of libraries for different projects.)

1 Like

Hi, thanks for offering access to the nightly version. Using the new spans.manual annotation UI, I run into problems annotating token word sub spans at the token last word end: it is difficult do include the last words last character in the span annotation result (the last character is missing in the annotation result). Thanks for your help in advance.

Hi Thomas,

Sorry to hear you're having trouble. I've tried the spans.manual recipe a bit myself, and haven't been able to identify any issues with spans near the end of the sentence. Can you provide a bit more detail on the issue you're running into - preferably some sample input, what you want to annotate, and what exactly goes wrong? (perhaps a screenshot could be useful too).

Thanks for your quick help! My annotation problem shows as follows:

Screenshot 2021-05-25 160430

{"text":"Schifffahrtskauffrau","_input_hash":720001308,"_task_hash":-511597073,"spans":[
{"start":0,"end":20,"label":"label1"},
{"start":12,"end":19,"label":"label2"},
{"start":16,"end":19,"label":"label3"}],
"meta":{"pattern":""},"_session_id":null,"_view_id":"spans_manual","answer":"accept"}

In the result label1 and label2 should also end at character 20 (which is difficult to select for me).
Thanks in previous!

Hm, when I use the -C flag for spans.manual it does work as I expect it to, also near the end of the sentence:

afbeelding

{"text":"Schifffahrtskauffrau","_input_hash":720001308,
"_task_hash":1641394602,"_session_id":null,"_view_id":"spans_manual",
"spans":[{"start":0,"end":20,"label":"l1"},
{"start":12,"end":20,"label":"l2"},
{"start":16,"end":20,"label":"l3"}],"answer":"accept"}

With this "character-level" annotation you really do need to be very careful how you annotate though, which is why we'd typically recommend doing token-based span annotations (in combination with a custom tokenizer if need be).

Thanks for your help! I love the new spans.manual UI. Maybe, I have to train a little bit to catch all characters as intended ... :slight_smile:

Which operating system and browser are you using? It's possible that there are combinations where the sensitivity of the text selection is different :thinking: (Just not 100% sure if this is something we can easily work around.)

I use Windows 10 with the latest Chrome browser.
When marking from left to right to the last character, the last letter is almost always missing. If you move the mouse back to the left after marking to the end of the last letter, so that the cursor goes from the arrow to the input marker, then the last character is also marked (but it is quite tedious).

I also noticed that with very many spans overlaid, additional annotation of very small character ranges is no longer possible. To get all the desired annotations here, you have to delete everything again and then start with the smallest strings and move to the longer strings then.

Thanks again for introducing the powerful new feature - hope that the information is still helpful.

I've noticed that earlier prodigy.json doesn't work anymore. Do you have documentation on how to set labels and colors for NER in the new nightly version?

Thanks, that's interesting! I think I might know what it could be: the existing span indicators and labels may interfere with the selection, because they're sort of "hidden" within it. And the effects of this can be subtly different, depending on the browser's/operating system's native select etc. I'll take a look at this, maybe there's something we can add to work around this! Glad to hear you like the feature and it's useful to you :blush:

What do you mean by "doesn't work anymore"? Is there an error, or do you just not see an effect from your changes? There shouldn't be any differences in any of the custom theme settings between v1.10. and v1.11. So if your previous settings are not reflected anymore, can you share the relevant parts of your prodigy.json?

It complains that format is not right. This used to work earlier:

{"labels": ["abc", "xyz"],
"custom_theme": {"labels":
     {"abc": "#a1c9f4", "xyz": "#debb9b"}
    }
}