I want to convert from either a Prodigy JSONL sample (or a better yet, a spaCy Doc), to a CoNLL 2003 sample. In CoNLL 2003 format documentation, I see there are 4 columns or items: "The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag". If my understanding is correct, I could obtain these items as follows:
I've never really worked with BILUO, but have you seen this offsets_to_biluo_tags helper function? I could be wrong, but I think it does what you need.
This function needs a doc as input though, but you can get that from Prodigy by first converting to spaCy. When you convert your data to the spaCy format via data-to-spacy you'll create a serialized Docbin object, which is a collection of spaCy documents.
To load the docs, you'll want to run something like:
Looks like I am already usingoffsets_to_biluo_tags for the 4th item (I have just updated the problem description BTW), but my interest is on the 3rd ítem: "syntactic chunk tag". An explanatory example about how CoNLL 2003 format looks like is:
1st item 2nd item 3rd item 4th item
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O
And a nice explanation about chunking is available here.
Hello @koaning , and thanks for the follow-up of this case.
It looks like I stumbled upon some huge challenge to have my project solved... But let me reply first to some of your requests / comments first:
Thanks for the suggestion. The thing is that "I am converting FROM spaCy TO 'something else' (CONLL 2003) format", and the destination format makes that "syntactic chunking tag" MANDATORY .
Yes I have. In fact, I think I mentioned it before (check my first objection in that Stack Overflow case), but even if I succeeded in implementing a conversion, I have just found out that "Spark NLP"requires data to be in CoNLL 2003 to train a NER model (which is what I am looking for in the end), and what spacy_conll gets me is "CoNLL-U" format which is different, and is used for a completely different task.
And now, "the main course": Sadly, it seems that "phrase chunking" (which is the broader denomination of that 3rd missing item in CoNLL 2003 format) is an unsolved problem in Computer Science, so in ANY case (or library used), the conversion could have errors which could in return, induce some more errors when the Spark NLP model gets trained. Specifically for spaCy, I found some implementations which look more or less like this, but that chunking is still not in "CoNLL 2003" format, because to carry on with the conversion, I must implement a set of rules to do so, which implicitely involves solving that "phrase chunking" problem I mentioned earlier.
The recommended solution for this issue therefore? Tagging manually in CoNLL-2003 format. And thus my spaCy data cannot be used any longer ...
Anyhow, at least I learnt some new topics .
Thanks for your support, if you have any idea, please let me know!