Ner format to CONLL

JoaoMVR · January 24, 2019, 2:37pm

I’ve just acquired prodigy to work on a manual ner tagging task. After manually tagging a document, the tool exports the results to a json format, the stantard spacy format. We would like to have this data tagged in the CoNLL format, the column format like so:

John PERSON
works O
for O
Microsoft ORGANIZATION

Is there an option to do this or should we opt to post-process the json file in order to do so?

Thanks in advance

ines · January 24, 2019, 4:07pm

Hi! I’d recommend writing your own converter, yes. spaCy actually ships with a biluo_tags_from_offsets helper that takes a text and character offsets and returns the BILUO entity labels. So this might be helpful?

You can also interact with Prodigy’s database directly from Python, so you’ll be able to skip the whole exporting/importing/exporting part.

Here’s an example (untested, but something along those lines should work):

from prodigy.components.db import connect
from spacy.gold import biluo_tags_from_offsets
from spacy.lang.en import English   # or whichever language tokenizer you need

nlp = English()

db = connect()  # uses settings from your prodigy.json
examples = db.get_dataset('your_dataset')  # load the annotations

for eg in examples:
    doc = nlp(eg['text'])
    entities = [(span['start'], span['end'], span['label'])
                for span in eg['spans']]
    tags = biluo_tags_from_offsets(doc, entities)
    # do something with the tags here

JoaoMVR · January 24, 2019, 4:32pm

Works perfectly, thanks a lot!

bjornvandijkman · June 4, 2019, 9:43am

I have tried to do this for my jsonl file. However, I’m quite new to programming. The following script almost works, but it seems to go over the same line in the json file 8 times. Can anyone point out to me what I’m doing wrong here? Result is my data here.

extended_tags = []
extended_entities = []
extended_token = []

for i in range(len(result)):
    data = result[i]
    for d in data:
        doc = nlp(data['text'])
        for token in doc:
            extended_token.append(token)
        entities = [(span['start'], span['end'], span['label'])
        for span in data['spans']]
        tags = biluo_tags_from_offsets(doc, entities)
        extended_tags.extend(tags)             
        extended_entities.extend(entities)

ines · June 4, 2019, 9:52am

@bjornvandijkman I think the indentation got a bit messed up when you copy-pasted the code over. Could you update that when you have a second? Otherwise, it’s a bit difficult to follow because it’s unclear what’s in which block. Also, what does your result look like?

bjornvandijkman · June 4, 2019, 10:06am

I updated the code. The result is a dataset created using ner.manual and imported to examples as you indicated. Then the only thing I did was removed the examples where I did not accept the annotation using the following code:

# Only keep the accepted answers, as the rejected ones have no span
result = []
for i in examples:
    if i['answer'] == "accept":
        result.append(i)

Format looks as follows for two of the lines:

text
_input_hash
_task_hash
tokens
_session_id
_view_id
spans
answer
text
_input_hash
_task_hash
tokens
_session_id
_view_id
spans
answer

ines · June 4, 2019, 10:16am

Thanks! I think the problem is this: for d in data:. In the outer loop, you’re going over each example in your dataset, which is a dictionary and which you’re storing in the variable data in your code.

However, you then go and also iterate over data and parse the text and create the tags each time. So basically, instead of doing it once per example, you’re doing it once for each key in your example dict. Iterating over a dict in Python is perfectly valid and what happens is that you iterating over the keys. For instance:

data = {"text": "hello", "meta": "world"}
for d in data:
    print(d)

This will print text and meta. Your example dict happens to have 8 keys, so your loop runs 8 times per example.

TL;DR: Remove the for d in data:, you don’t need that.

Btw, another tip: If you have a list (like your examples in result), you can also just iterate over its elements in the for loop, instead of indexing into it. For example, instead of this:

for i in range(len(result)):
    data = result[i]

… you can write this:

for data in result:

bjornvandijkman · June 4, 2019, 10:24am

Thank you for being so patient with me and thanks for the help! The support section is truly amazing here

Topic		Replies	Views
Prodigy JSONL (or spaCY Doc) to CoNLL 2003 usage , ner , spacy , custom	4	926	November 2, 2022
convert .tsv format to prodigy jsonl ner , spacy	1	749	February 8, 2021
JSONL format to CONLL	3	1331	January 12, 2023
Convert NER format to CoNLL 2003 Format usage , spacy , third-party	1	727	December 30, 2021
How to convert IOB/BILOU format into jsonl ner , spacy	5	2536	November 4, 2021

Ner format to CONLL

Related topics