Misaligned entities only in train-curve

Hey @ines I hope it's okay if I ask a follow-up in here.
I went ahead and added the encoding="utf-8" directly in my site-packages and the recipe runs now.
I get the following warning though:

entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str,
C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\spacy\training\iob_utils.py:142: UserWarning: [W030] Some entities could not be aligned in the text "[16] II.3. Das Ausmaß der Zinsminderung richtet si..." with entities "[(121, 130, 'JUSTIZ'), (275, 289, 'JUSTIZ'), (291,...". Use spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities) to check the alignment. Misaligned entities ('-') will be ignored during training.

What's weird to me about this is that, once again, I'm not getting this when running any oher recipe. I used to get it during training, when I wasn't using my custom tokenizer, but I'm providing the model and callbacks to train-curve and yet I still see this issue.

This is the command I'm using:

python  -m prodigy train-curve --ner train_ner_citation -m .\final-model-t2v\ -F .\functions.py

Do you have any pointers as to why this is happening here but not with other recipies?

Hi! I hope it's okay I moved this to a separate thread – since it's a slightly different probem, this makes it a bit easier to keep track of the reports :slight_smile:

That's definitely strange – so when you run train instead of train-curve, it works fine and you don't see the error? If you're using a custom tokenizer, I do wonder if somehow, it ends up not detecting/using it properly in the train-curve recipe. Which again is slightly weird because the train-curve should generate the config the same way.

One option to check this would be to print config.to_str() in the recipe script to double-check that it looks correct and has your custom tokenizer.

Another idea I had was that it could be related to the random split and different examples ending up in training vs. evaluation across the runs. But that doesn't make sense, because that's explicitly something we've taken care to avoid: train-curve will always evaluate on the same held-back evaluation set so the results stay comparable across different runs.

That's correct :man_shrugging:
I do think it's detecting the custom tokenizer, because it complains about not being able to find that callback, if I don't provide the -F functions.py.

Where can I find the recipes, to edit them? Or do you mean I should write my own to add config.to_str()?

You can run prodigy stats to find the location of your Prodigy installation, and then you can just edit recipes/train.py in there and add your own logging for debugging. So right after it creates the config in train-curve recipe, you could add print(config.to_str()) to output the formatted version of the config so you can double-check that it looks correct.

I just wanted to follow your advice, but the issue is gone. I have no clue why, nothing has changed, but I don' get the error anymore.

Thank you for your help!