colloquial pronouns not labeled as pron

currently I use spacy only for lemmatization/parsing of colloquial language, without word vector capabilities. I have a problem with the POS-tagging. my code

tokens = [tok.lemma_.lower().strip() for tok in doc if tok.pos_ != ‘PRON’]

recognizes “my”, “your” etc as PRON, but not “mine”, “your’s” My current hack is to modify all occurrences of “mine” by “my” but it’s hardly elegant. (The word “mine” does not occur as “the explosive device” in my document.)

suggestions? or just keep hardcoding



If your hard-coded solution works well, why not :wink:

But you could also try and improve the POS tagger, specifically the PRON label, on your data using the pos.teach recipe:

prodigy pos.teach your_dataset en_core_web_sm ./your_data.jsonl --label PRON

Once you’ve labelled some examples, you can run pos.batch-train and see if it improves. Ideally, you also want to evaluate it against a representative set that includes a good mix of pronouns.