Not understanding how pos.make-gold works

I’m trying to use pos.make-gold. Per the recommendations in the docs, I’m doing a subset of tags at a time. Here is an example sequence:
prodigy dataset test "Just testing"
prodigy pos.make-gold test en_core_web_sm test.jsonl -l NOUN

This works fine and I can add and remove nouns. My test.jsonl has only five entries and I go through all of them. I have two questions about next steps.

(1) pos.gold-to-spacy doesn’t output anything

When I do:
prodigy pos.gold-to-spacy test gold.jsonl
I get this error message:
✨ Exported 0 examples (skipped 123 containing invalid spans)
gold.jsonl
and the resulting file is empty.

My input looks like this:
{"n": "18", "text": "This is text.", "pn": "1234"}
so I don’t know where the spans are coming from.

(2) How to combine with annotations for other labels

I now want to work on verbs instead of nouns, so I do this:
prodigy pos.make-gold test en_core_web_sm test.jsonl -l VERB
But the website just says:
No tasks available.
so I can’t do anything.

Is there a way to tell pos.make-gold to go back to the beginning?

@ines, I’d appreciate if you’d get back to me on this. I’ve read through the documentation many times and haven’t been able to figure this out.

Hi and sorry for only getting to this now! I remember looking at this thread but there were a few things I wanted to check to see whether you were potentially running into a bug.

The main validation performed by the pos.gold-to-spacy is making sure that the highlighted spans in the data only contain one single token, since that's the unit for part-of-speech tags (if you set PRODIGY_LOGGING=basic, you should see the examples it's skipping).

However, after looking at the code, I wonder if we have an off-by-one error in there and whether if (token_start + 1) != span["token_end"]: should maybe be if token_start != span["token_end"]: (If so, how did this ever work?! :thinking: Argh!)

The token indices assigned to the spans are inclusive, so a span for token 5 would have the start and end token index 5 in this case. I'll find a test dataset and try it out, but in the meantime, try modifying this line in the pos.gold-to-spacy recipe in recipes/pos.py. To find the location of you Prodigy installation, you can run the following:

python -c "import prodigy; print(prodigy.__file__)"

If this ends up being the problem, sorry about that :woman_facepalming:

Hmm, I'm surprised this is happening, because the exclude logic does take the task hashes and thus the text plus suggested annotations into account. So if you annotate different labels and Prodigy suggests the same text with, say, VERB instead of NOUN, it should receive a different hash and shouldn't be skipped. And if nouns can be predicted, there'll definitely be verbs, so this can't be the problem.

You could try using a new dataset or setting "auto_exclude_current": false in your prodigy.json. This will prevent Prodigy from checking the incoming tasks against the current dataset and excluding "duplicates". If this ends up fixing it, it'd be cool if you could send me an example each from both sets, so I can debug this and check the hashing.

Thank you, I’ll work on these and get back to you!