using data-to-spacy
with a --base-model
gives alignment warnings and complains that there is no sentencizer
(I don't want to use one) and splits into sentences anyway...
How can I go from prodigy->conll (BIO2) format (not BILUO)?
using data-to-spacy
with a --base-model
gives alignment warnings and complains that there is no sentencizer
(I don't want to use one) and splits into sentences anyway...
How can I go from prodigy->conll (BIO2) format (not BILUO)?
If you want to export your data for use with spaCy, spaCy's training format always expects sentences under the hood. You can see an example of this here: https://spacy.io/api/annotation#json-input
If your annotations don't align to the base model's tokenization and you don't need annotations in spaCy's training format, I'm not sure data-to-spacy
is a good fit here? If you've collected annotations with a manual workflow like ner.manual
or ner.correct
, the data you can export with db-out
gives you the "tokens"
and character offsets of each annotated spans, as well as their start and end token IDs. So the start token is B
, all other tokens in the span are I
and all other tokens are O
.
If you only have character offsets, spaCy can also generate IOB tags out-of-the-box: https://prodi.gy/docs/named-entity-recognition#tip-offsets-biluo