Pipeline for POS corrections and dep corrections

jmankewitz · March 31, 2021, 12:44am

Hello! New Prodigy user here!

We have some large corpora of conversational speech we'd like to eventually parse with spacy and correct with prodigy. However, the nature of the corpora (child-produced speech) means some POS tags need to be hand corrected as well, which will affect the dependency parse.

Our first thought is roughly the following:

Use spacy to tag POS
correct POS with prodigy
Use the gold standard POS tags to inform spacy dep parses
hand correct the dep parses

It seems like it'll cut down on effort if we can run the dependency parse "live" through prodigy after the pos tagging is corrected, but I'm not sure what that would look like...

Any thoughts would be very welcome!

adriane · March 31, 2021, 5:54pm

Two notes:

The spaCy tagger and the parser components are completely separate (POS is not a feature for the dependency parser), so you can treat the POS annotation and dependency parse annotation as completely separate tasks. And if you only need parses in the end, you could focus on the dependency annotation.
Be aware that the pos recipes default to showing you Token.pos (UPOS) when the model underneath is predicting Token.tag (fine-grained tags) in most cases (for spaCy v2 and prodigy v10, anyway). If you're fine-tuning an existing tagger like from en_core_web_sm, you should use the --fine-grained option to work with the tags that the tagger is predicting directly, see more info here: Retraining POS Tagger · Issue #6283 · explosion/spaCy · GitHub

As mentioned in that comment, we're updating the tag recipes for the upcoming version of prodigy that supports spacy v3, where this should hopefully be more straightforward.

Topic		Replies	Views
Modifying a document based on POS and DEP docs , usage , spacy , dep , pos	1	476	October 4, 2021
Training dependency parser usage , ner , done , spacy	5	3880	March 11, 2018
model extraction from ( prodigy command vs custom model_train code ) and usage of it. done , spacy	1	481	June 25, 2018
lemmas in the annotation workflow	2	277	April 7, 2023
Linguistic features configured for a non-english model usage , spacy , solved	2	466	January 11, 2019

Pipeline for POS corrections and dep corrections

Related topics