I’ve just acquired prodigy to work on a manual ner tagging task. After manually tagging a document, the tool exports the results to a json format, the stantard spacy format. We would like to have this data tagged in the CoNLL format, the column format like so:
John PERSON
works O
for O
Microsoft ORGANIZATION
Is there an option to do this or should we opt to post-process the json file in order to do so?
Hi! I’d recommend writing your own converter, yes. spaCy actually ships with a biluo_tags_from_offsets helper that takes a text and character offsets and returns the BILUO entity labels. So this might be helpful?
You can also interact with Prodigy’s database directly from Python, so you’ll be able to skip the whole exporting/importing/exporting part.
Here’s an example (untested, but something along those lines should work):
from prodigy.components.db import connect
from spacy.gold import biluo_tags_from_offsets
from spacy.lang.en import English # or whichever language tokenizer you need
nlp = English()
db = connect() # uses settings from your prodigy.json
examples = db.get_dataset('your_dataset') # load the annotations
for eg in examples:
doc = nlp(eg['text'])
entities = [(span['start'], span['end'], span['label'])
for span in eg['spans']]
tags = biluo_tags_from_offsets(doc, entities)
# do something with the tags here
I have tried to do this for my jsonl file. However, I’m quite new to programming. The following script almost works, but it seems to go over the same line in the json file 8 times. Can anyone point out to me what I’m doing wrong here? Result is my data here.
extended_tags = []
extended_entities = []
extended_token = []
for i in range(len(result)):
data = result[i]
for d in data:
doc = nlp(data['text'])
for token in doc:
extended_token.append(token)
entities = [(span['start'], span['end'], span['label'])
for span in data['spans']]
tags = biluo_tags_from_offsets(doc, entities)
extended_tags.extend(tags)
extended_entities.extend(entities)
@bjornvandijkman I think the indentation got a bit messed up when you copy-pasted the code over. Could you update that when you have a second? Otherwise, it’s a bit difficult to follow because it’s unclear what’s in which block. Also, what does your result look like?
I updated the code. The result is a dataset created using ner.manual and imported to examples as you indicated. Then the only thing I did was removed the examples where I did not accept the annotation using the following code:
# Only keep the accepted answers, as the rejected ones have no span
result = []
for i in examples:
if i['answer'] == "accept":
result.append(i)
Format looks as follows for two of the lines:
text
_input_hash
_task_hash
tokens
_session_id
_view_id
spans
answer
text
_input_hash
_task_hash
tokens
_session_id
_view_id
spans
answer
Thanks! I think the problem is this: for d in data:. In the outer loop, you’re going over each example in your dataset, which is a dictionary and which you’re storing in the variable data in your code.
However, you then go and also iterate over data and parse the text and create the tags each time. So basically, instead of doing it once per example, you’re doing it once for each key in your example dict. Iterating over a dict in Python is perfectly valid and what happens is that you iterating over the keys. For instance:
data = {"text": "hello", "meta": "world"}
for d in data:
print(d)
This will print text and meta. Your example dict happens to have 8 keys, so your loop runs 8 times per example.
TL;DR: Remove the for d in data:, you don’t need that.
Btw, another tip: If you have a list (like your examples in result), you can also just iterate over its elements in the for loop, instead of indexing into it. For example, instead of this: