Converting Ontonotes ( .gold_conll file format) to DocBins (.spacy) or .jsonl files.

Hi!

TLDR; How to convert .gold_conll (as seen in table below) to .spacy or .jsonl?

#begin document (bc/cctv/00/cctv_0001); part 000
bc/cctv/00/cctv_0001   0    6               a    DT      (NP(NP*        -    -   -   Speaker#1       *             *    (ARG1*   -
bc/cctv/00/cctv_0001   0    7         special    JJ            *        -    -   -   Speaker#1       *             *         *   -
bc/cctv/00/cctv_0001   0    8         edition    NN            *)       -    -   -   Speaker#1       *             *         *   -
bc/cctv/00/cctv_0001   0    9              of    IN         (PP*        -    -   -   Speaker#1       *             *         *   -
bc/cctv/00/cctv_0001   0   10          Across   NNP         (NP*        -    -   -   Speaker#1   (ORG*             *         *   -
bc/cctv/00/cctv_0001   0   11           China   NNP      *)))))))       -    -   -   Speaker#1       *)            *)        *)  -

#end document

My plan is to work with the OntoNotes data in spaCy and prodigy. I have retrieved the OntoNotes dataset, however, I am struggling to convert my data to a format. I see that the format is similar, but not identical, to two of the sample data formats that may be converted using the python -m spacy converter CLI-command, both trying the --converter auto, conllu, and conll (ner-token-per-line.iob and ner-token-per-line-conll2003.iob). Above is a table of what my data looks like, and this is where I have retrieved it from. I have been unable to find the entire dataset anywhere else, and I see that none of the HuggingFace datasets are available with 1) all the data, and 2) in the right format.

hi @emiltj!

Thanks for posting your question on spaCy GitHub discussions.

Here's the link for others who find this post: