Prodigy JSONL (or spaCY Doc) to CoNLL 2003

Hello everyone,

I want to convert from either a Prodigy JSONL sample (or a better yet, a spaCy Doc), to a CoNLL 2003 sample. In CoNLL 2003 format documentation, I see there are 4 columns or items: "The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag". If my understanding is correct, I could obtain these items as follows:

  • 1st item: The original word.
  • 2nd item: Could be obtained from "tag_" Token attribute.
  • 3rd item: [UNKNOWN]
  • 4th item: Could be obtained from "offsets_to_biluo_tags" (former "biluo_tags_from_offsets", as shown here)

NOTE: BTW, the previous implementation (as well my other attempts) are available here.

With this preliminary grasping, I have the following questions (one per each item):

  • 2nd item: Does "tag_" Token attribute hold the same format than the one used in CoNLL 2003?
  • 3rd item: (Where to get this data from?)

Hope you can help me out with those queries.

Thank you!

I've never really worked with BILUO, but have you seen this offsets_to_biluo_tags helper function? I could be wrong, but I think it does what you need.

This function needs a doc as input though, but you can get that from Prodigy by first converting to spaCy. When you convert your data to the spaCy format via data-to-spacy you'll create a serialized Docbin object, which is a collection of spaCy documents.

To load the docs, you'll want to run something like:

import spacy
from spacy.tokens import DocBin

nlp = spacy.load("whatever_model_you_want_to_use")
doc_bin = DocBin().from_disk("./data.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))

Hello @koaning ,

Looks like I am already using offsets_to_biluo_tags for the 4th item (I have just updated the problem description BTW), but my interest is on the 3rd ítem: "syntactic chunk tag". An explanatory example about how CoNLL 2003 format looks like is:

1st item     2nd item    3rd item    4th item
U.N.         NNP         I-NP        I-ORG 
official     NN          I-NP        O 
Ekeus        NNP         I-NP        I-PER 
heads        VBZ         I-VP        O 
for          IN          I-PP        O 
Baghdad      NNP         I-NP        I-LOC 
.            .           O           O 

And a nice explanation about chunking is available here.

Hope it gets clearer now, thanks!

If you're interested in using spaCy: you can leave it empty with a placeholder like _ . The spaCy converter ignores this column.

Before diving further, one thing that might be worth checking, have you seen this project?

I cannot judge how well this project might fit your needs, but it seems like it's a maintained project.

Hello @koaning , and thanks for the follow-up of this case.

It looks like I stumbled upon some huge challenge to have my project solved... But let me reply first to some of your requests / comments first:

Thanks for the suggestion. The thing is that "I am converting FROM spaCy TO 'something else' (CONLL 2003) format", and the destination format makes that "syntactic chunking tag" MANDATORY :confused:.

Yes I have. In fact, I think I mentioned it before (check my first objection in that Stack Overflow case), but even if I succeeded in implementing a conversion, I have just found out that "Spark NLP" requires data to be in CoNLL 2003 to train a NER model (which is what I am looking for in the end), and what spacy_conll gets me is "CoNLL-U" format which is different, and is used for a completely different task.

And now, "the main course": Sadly, it seems that "phrase chunking" (which is the broader denomination of that 3rd missing item in CoNLL 2003 format) is an unsolved problem in Computer Science, so in ANY case (or library used), the conversion could have errors which could in return, induce some more errors when the Spark NLP model gets trained. Specifically for spaCy, I found some implementations which look more or less like this, but that chunking is still not in "CoNLL 2003" format, because to carry on with the conversion, I must implement a set of rules to do so, which implicitely involves solving that "phrase chunking" problem I mentioned earlier.

The recommended solution for this issue therefore? Tagging manually in CoNLL-2003 format. And thus my spaCy data cannot be used any longer :face_with_head_bandage:...

Anyhow, at least I learnt some new topics :sweat_smile:.

Thanks for your support, if you have any idea, please let me know!

1 Like