combining two annotated datasets

AnnaAnia · July 23, 2020, 11:31am

Hi,

I have two annotated data sets from two different annotation tools

one has been converted into your accepted JSON input format for training
the other comes from Prodigy annotations

I'm trying to convert prodigy output into json input format, like so, but the character offsets are misaligned.

nlp = spacy.blank('en')


def get_entity_tuples(spans: dict):
    entity_tuples = []
    print(spans)
    for span in spans:
        entity_tuples.append((span.get('start'), span.get('end'), span.get('label')))
    return entity_tuples


json_input = []
for page_id, page in enumerate(accepted_annot):
    file_name = page.get('meta').get('filename', None)
    raw_string = page.get('text')
    doc = nlp(raw_string)
    offsets = get_entity_tuples(page.get('spans',[]))
    ner = biluo_tags_from_offsets(doc, offsets)
    tokens = []
    for token in doc:
        new_token = {"id": token.i, "orth": token.text, "ner": ner[token.i]}
        tokens.append(new_token)
    sentences = {"id": 0, "tokens": tokens}
    paragraph = {"raw": raw_string, "sentences": sentences}
    result = {"id": str(file_name), "paragraphs": [paragraph]}
    json_input.append(result)

E.g.

{'text': '11th', 'start': 1965, 'end': 1969, 'id': 433} # Prodigy output

doc[433]
'11th' #correct

#But...
doc[433].idx
2015 # not 1965

Please can you advise where I'm going wrong and if there is a better way of combining the two sets of training data?

Thank you

Anna

ines · July 27, 2020, 9:11am

Hi! Do you have overlaps in your two datasets, like annotations on the same text, present in both datasets? If not, the simplest solution might be to use data-to-spacy to convert your Prodigy annotations to JSON, and then concatenate the data so you have one large JSON dataset to train from?

AnnaAnia · July 27, 2020, 9:31am

Hi @ines,

No overlaps in the data so they can just be appended together.

I will look at data-to-spacy. Trying to do it manually keeps giving me various errors and offsets not aligning

I figured that char offsets in the spans do not account for spaces. Instead, I've tried using token offsets but these also don't align to the doc as per below...

Looks like the tokenization is different between what comes out of Prodigy and when I nlp to create biluo_tags_from_offsets

nlp = spacy.blank('en')


def get_entity_tuples(doc, spans: dict):
    entity_tuples = []
    print(spans)
    for span in spans:
        token_start = span.get('token_start')
        token_end = span.get('token_end')
        print(doc)
        print('doc len: ', len(doc))
        print('span: ', span)
        print('token_start: ', str(token_start))
        print('token_end: ', str(token_end))
        print('doc[token_start]: ', str(doc[token_start]))
        print('doc[token_end]: ', str(doc[token_end]))
        token_start_char_start = doc[token_start].idx
        token_end_char_end = doc[token_end].idx + len(doc[token_end])
        entity_tuples.append((token_start_char_start, token_end_char_end, span.get('label')))
    return entity_tuples


json_input = []
for page_id, page in enumerate(accepted_annot):
    file_name = page.get('meta').get('filename', None)
    raw_string = page.get('text')
    doc = nlp(raw_string)
    offsets = get_entity_tuples(doc, page.get('spans',[]))
    ner = biluo_tags_from_offsets(doc, offsets)
    tokens = []
    for token in doc:
        new_token = {"id": token.i, "orth": token.text, "ner": ner[token.i]}
        tokens.append(new_token)
    sentences = {"id": 0, "tokens": tokens}
    paragraph = {"raw": raw_string, "sentences": sentences}
    result = {"id": str(file_name), "paragraphs": [paragraph]}
    json_input.append(result)

Error …

doc len:  156
span:  {'start': 751, 'end': 761, 'token_start': 156, 'token_end': 157, 'label': 'VALUE'}
token_start:  156
token_end:  157
Traceback (most recent call last):
  File "C:\Users\<>\AppData\Local\Continuum\anaconda3\envs\cit_ner_mod\lib\site-packages\IPython\core\interactiveshell.py", line 3325, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-143-d15be16eab9d>", line 28, in <module>
    offsets = get_entity_tuples(doc, page.get('spans',[]))
  File "<ipython-input-143-d15be16eab9d>", line 15, in get_entity_tuples
    print('doc[token_start]: ', str(doc[token_start]))
  File "doc.pyx", line 295, in spacy.tokens.doc.Doc.__getitem__
  File "token.pxd", line 21, in spacy.tokens.token.Token.cinit
IndexError: [E040] Attempt to access token at 156, max length 156.

AnnaAnia · July 27, 2020, 9:34am

I see data-to-spay is available for Prodigy 1.9 and I'm on 1.8.5 I believe and I'm unable to update at this point in time.
Please would you be able to share a manual solution?

AnnaAnia · July 27, 2020, 10:52am

Ok, so to get around the challenge with tokenization, I have used the token from Prodigy output to put the doc back together like so..

words = []
for token in tokens:
    words.append(token.get('text'))

doc = Doc(nlp.vocab, words=words)

I need to do some more testing, but looks like this may be my solution.

If there is a better way, I'd be keen to understand it.

Thank you

Anna

ines · July 28, 2020, 8:50am

The manual solution would be to create Doc objects for each annotation and then run gold.docs_to_json.

And yes, whitespace is not included in the tokens and each token indicates whether it's followed by whitespace or not (Token.whitespace_). Your approach looks good, but you might want to include a list of spaces as well when creating the doc, booleans indicating whether that token at the position is followed by a space or not. Otherwise, a space is assumed, so you'd end up with ["I", "'m", "happy", "!"] → "I 'm happy !.

Topic		Replies	Views
Text corpus .txt file to json/spacy format file usage , spacy , solved	5	1357	July 2, 2021
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	614	June 15, 2020
Prodigy annotations to SpaCy train spacy	13	5646	January 31, 2018
Converting SpaCy training json file to Prodigy jsonl format usage , spacy	9	3052	April 17, 2023
update spacy model ner , spacy , solved , training	6	1159	October 8, 2021

combining two annotated datasets

Related topics