Not understanding how pos.make-gold works

Jeff · January 29, 2019, 8:26pm

I’m trying to use pos.make-gold. Per the recommendations in the docs, I’m doing a subset of tags at a time. Here is an example sequence:
prodigy dataset test "Just testing"
prodigy pos.make-gold test en_core_web_sm test.jsonl -l NOUN

This works fine and I can add and remove nouns. My test.jsonl has only five entries and I go through all of them. I have two questions about next steps.

(1) pos.gold-to-spacy doesn’t output anything

When I do:
prodigy pos.gold-to-spacy test gold.jsonl
I get this error message:
✨ Exported 0 examples (skipped 123 containing invalid spans)
gold.jsonl
and the resulting file is empty.

My input looks like this:
{"n": "18", "text": "This is text.", "pn": "1234"}
so I don’t know where the spans are coming from.

(2) How to combine with annotations for other labels

I now want to work on verbs instead of nouns, so I do this:
prodigy pos.make-gold test en_core_web_sm test.jsonl -l VERB
But the website just says:
No tasks available.
so I can’t do anything.

Is there a way to tell pos.make-gold to go back to the beginning?

Jeff · February 1, 2019, 4:44pm

@ines, I’d appreciate if you’d get back to me on this. I’ve read through the documentation many times and haven’t been able to figure this out.

ines · February 1, 2019, 5:15pm

Hi and sorry for only getting to this now! I remember looking at this thread but there were a few things I wanted to check to see whether you were potentially running into a bug.

The main validation performed by the pos.gold-to-spacy is making sure that the highlighted spans in the data only contain one single token, since that's the unit for part-of-speech tags (if you set PRODIGY_LOGGING=basic, you should see the examples it's skipping).

However, after looking at the code, I wonder if we have an off-by-one error in there and whether if (token_start + 1) != span["token_end"]: should maybe be if token_start != span["token_end"]: (If so, how did this ever work?! Argh!)

The token indices assigned to the spans are inclusive, so a span for token 5 would have the start and end token index 5 in this case. I'll find a test dataset and try it out, but in the meantime, try modifying this line in the pos.gold-to-spacy recipe in recipes/pos.py. To find the location of you Prodigy installation, you can run the following:

python -c "import prodigy; print(prodigy.__file__)"

If this ends up being the problem, sorry about that

Hmm, I'm surprised this is happening, because the exclude logic does take the task hashes and thus the text plus suggested annotations into account. So if you annotate different labels and Prodigy suggests the same text with, say, VERB instead of NOUN, it should receive a different hash and shouldn't be skipped. And if nouns can be predicted, there'll definitely be verbs, so this can't be the problem.

You could try using a new dataset or setting "auto_exclude_current": false in your prodigy.json. This will prevent Prodigy from checking the incoming tasks against the current dataset and excluding "duplicates". If this ends up fixing it, it'd be cool if you could send me an example each from both sets, so I can debug this and check the hashing.

Jeff · February 2, 2019, 9:48pm

Thank you, I’ll work on these and get back to you!

Topic		Replies	Views
How to do POS tagging with this tool? usage , pos	1	867	December 13, 2019
The text is split to several text chunks while using ner.make_gold usage , ner	4	496	March 12, 2019
Pipeline for POS corrections and dep corrections usage , spacy , dep , pos	1	557	March 31, 2021
Cannot use the ner.gold-to-spacy output JSONL data to train in spacy train usage , ner , spacy , solved	3	670	June 20, 2019
NER and POS Tagging Annotation using One Prodigy User Interface	2	16	January 31, 2025

Not understanding how pos.make-gold works

Related topics