Hi!
TLDR; How to convert .gold_conll (as seen in table below) to .spacy or .jsonl?
#begin document (bc/cctv/00/cctv_0001); part 000
bc/cctv/00/cctv_0001 0 6 a DT (NP(NP* - - - Speaker#1 * * (ARG1* -
bc/cctv/00/cctv_0001 0 7 special JJ * - - - Speaker#1 * * * -
bc/cctv/00/cctv_0001 0 8 edition NN *) - - - Speaker#1 * * * -
bc/cctv/00/cctv_0001 0 9 of IN (PP* - - - Speaker#1 * * * -
bc/cctv/00/cctv_0001 0 10 Across NNP (NP* - - - Speaker#1 (ORG* * * -
bc/cctv/00/cctv_0001 0 11 China NNP *))))))) - - - Speaker#1 *) *) *) -
#end document
My plan is to work with the OntoNotes data in spaCy and prodigy. I have retrieved the OntoNotes dataset, however, I am struggling to convert my data to a format. I see that the format is similar, but not identical, to two of the sample data formats that may be converted using the python -m spacy converter CLI-command, both trying the --converter auto, conllu, and conll (ner-token-per-line.iob and ner-token-per-line-conll2003.iob). Above is a table of what my data looks like, and this is where I have retrieved it from. I have been unable to find the entire dataset anywhere else, and I see that none of the HuggingFace datasets are available with 1) all the data, and 2) in the right format.