Pipeline for POS corrections and dep corrections

Hello! New Prodigy user here!

We have some large corpora of conversational speech we'd like to eventually parse with spacy and correct with prodigy. However, the nature of the corpora (child-produced speech) means some POS tags need to be hand corrected as well, which will affect the dependency parse.

Our first thought is roughly the following:

  1. Use spacy to tag POS
  2. correct POS with prodigy
  3. Use the gold standard POS tags to inform spacy dep parses
  4. hand correct the dep parses

It seems like it'll cut down on effort if we can run the dependency parse "live" through prodigy after the pos tagging is corrected, but I'm not sure what that would look like...

Any thoughts would be very welcome!

Two notes:

  • The spaCy tagger and the parser components are completely separate (POS is not a feature for the dependency parser), so you can treat the POS annotation and dependency parse annotation as completely separate tasks. And if you only need parses in the end, you could focus on the dependency annotation.

  • Be aware that the pos recipes default to showing you Token.pos (UPOS) when the model underneath is predicting Token.tag (fine-grained tags) in most cases (for spaCy v2 and prodigy v10, anyway). If you're fine-tuning an existing tagger like from en_core_web_sm, you should use the --fine-grained option to work with the tags that the tagger is predicting directly, see more info here: Retraining POS Tagger · Issue #6283 · explosion/spaCy · GitHub

As mentioned in that comment, we're updating the tag recipes for the upcoming version of prodigy that supports spacy v3, where this should hopefully be more straightforward.