Hi,
I wanted to try the ner.make-gold
recipe on a dataset I’ve been building for some time (i.e. through several version of prodigy including beta…). I’m getting a ValueError: Mismatched tokenization.
telling me that prodigy can’t find the span at the provided start / end indices, with a very nice and clear message by the way
The thing is that I don’t through away this dataset :). I guess this problem won’t append a lot so maybe you could log a warning telling that this example will be discarded and continue with the next example ?
And concerning the error itself, It’s quite strange because when processing the sentence with spacy I’m getting the same start / end indices as those recorded in my dataset. Here is the example:
{
"text": " The upstart streaming service, which is primarily geared for sports fans, has an uphill climb against deep-pocketed competitors marketing cable alternatives to cord-cutters: YouTube TV, Hulu Live and Sony's PlayStation Vue.",
"spans": [
{
"answer": "accept",
"end": 185,
"input_hash": -2121127423,
"label": "PRODUCT",
"rank": 1,
"score": 0.3231683859,
"source": "core_web_sm",
"start": 175,
"text": "YouTube TV"
},
Did you had some preprocessing on sentences, like stripping whitespace ?
Thomas