Converting Ontonotes ( .gold_conll file format) to DocBins (.spacy) or .jsonl files.

emiltj · February 3, 2023, 10:32am

Hi!

TLDR; How to convert .gold_conll (as seen in table below) to .spacy or .jsonl?

#begin document (bc/cctv/00/cctv_0001); part 000
bc/cctv/00/cctv_0001   0    6               a    DT      (NP(NP*        -    -   -   Speaker#1       *             *    (ARG1*   -
bc/cctv/00/cctv_0001   0    7         special    JJ            *        -    -   -   Speaker#1       *             *         *   -
bc/cctv/00/cctv_0001   0    8         edition    NN            *)       -    -   -   Speaker#1       *             *         *   -
bc/cctv/00/cctv_0001   0    9              of    IN         (PP*        -    -   -   Speaker#1       *             *         *   -
bc/cctv/00/cctv_0001   0   10          Across   NNP         (NP*        -    -   -   Speaker#1   (ORG*             *         *   -
bc/cctv/00/cctv_0001   0   11           China   NNP      *)))))))       -    -   -   Speaker#1       *)            *)        *)  -

#end document

My plan is to work with the OntoNotes data in spaCy and prodigy. I have retrieved the OntoNotes dataset, however, I am struggling to convert my data to a format. I see that the format is similar, but not identical, to two of the sample data formats that may be converted using the python -m spacy converter CLI-command, both trying the --converter auto, conllu, and conll (ner-token-per-line.iob and ner-token-per-line-conll2003.iob). Above is a table of what my data looks like, and this is where I have retrieved it from. I have been unable to find the entire dataset anywhere else, and I see that none of the HuggingFace datasets are available with 1) all the data, and 2) in the right format.

ryanwesslen · February 3, 2023, 1:00pm

hi @emiltj!

Thanks for posting your question on spaCy GitHub discussions.

Here's the link for others who find this post:

Topic		Replies	Views
Convert spacy binary data to jsonl ner , spacy , solved , nightly	5	4293	April 28, 2022
Conversion of Dependency ner rel.manual jsonl data to spacy ner , relations	3	552	April 22, 2022
Convert DocBins or .spacy files to .jsonl format usage , ner , spacy	2	872	January 3, 2023
JSONL format to CONLL	3	1394	January 12, 2023
convert .tsv format to prodigy jsonl ner , spacy	1	773	February 8, 2021

Converting Ontonotes ( .gold_conll file format) to DocBins (.spacy) or .jsonl files.

Related topics