combining two annotated datasets

Hi,

I have two annotated data sets from two different annotation tools

  • one has been converted into your accepted JSON input format for training
  • the other comes from Prodigy annotations

I'm trying to convert prodigy output into json input format, like so, but the character offsets are misaligned.

nlp = spacy.blank('en')


def get_entity_tuples(spans: dict):
    entity_tuples = []
    print(spans)
    for span in spans:
        entity_tuples.append((span.get('start'), span.get('end'), span.get('label')))
    return entity_tuples


json_input = []
for page_id, page in enumerate(accepted_annot):
    file_name = page.get('meta').get('filename', None)
    raw_string = page.get('text')
    doc = nlp(raw_string)
    offsets = get_entity_tuples(page.get('spans',[]))
    ner = biluo_tags_from_offsets(doc, offsets)
    tokens = []
    for token in doc:
        new_token = {"id": token.i, "orth": token.text, "ner": ner[token.i]}
        tokens.append(new_token)
    sentences = {"id": 0, "tokens": tokens}
    paragraph = {"raw": raw_string, "sentences": sentences}
    result = {"id": str(file_name), "paragraphs": [paragraph]}
    json_input.append(result)

E.g.

{'text': '11th', 'start': 1965, 'end': 1969, 'id': 433} # Prodigy output

doc[433]
'11th' #correct

#But...
doc[433].idx
2015 # not 1965

Please can you advise where I'm going wrong and if there is a better way of combining the two sets of training data?

Thank you

Anna

Hi! Do you have overlaps in your two datasets, like annotations on the same text, present in both datasets? If not, the simplest solution might be to use data-to-spacy to convert your Prodigy annotations to JSON, and then concatenate the data so you have one large JSON dataset to train from?

Hi @ines,

No overlaps in the data so they can just be appended together.

I will look at data-to-spacy. Trying to do it manually keeps giving me various errors and offsets not aligning

I figured that char offsets in the spans do not account for spaces. Instead, I've tried using token offsets but these also don't align to the doc as per below...

Looks like the tokenization is different between what comes out of Prodigy and when I nlp to create biluo_tags_from_offsets

nlp = spacy.blank('en')


def get_entity_tuples(doc, spans: dict):
    entity_tuples = []
    print(spans)
    for span in spans:
        token_start = span.get('token_start')
        token_end = span.get('token_end')
        print(doc)
        print('doc len: ', len(doc))
        print('span: ', span)
        print('token_start: ', str(token_start))
        print('token_end: ', str(token_end))
        print('doc[token_start]: ', str(doc[token_start]))
        print('doc[token_end]: ', str(doc[token_end]))
        token_start_char_start = doc[token_start].idx
        token_end_char_end = doc[token_end].idx + len(doc[token_end])
        entity_tuples.append((token_start_char_start, token_end_char_end, span.get('label')))
    return entity_tuples


json_input = []
for page_id, page in enumerate(accepted_annot):
    file_name = page.get('meta').get('filename', None)
    raw_string = page.get('text')
    doc = nlp(raw_string)
    offsets = get_entity_tuples(doc, page.get('spans',[]))
    ner = biluo_tags_from_offsets(doc, offsets)
    tokens = []
    for token in doc:
        new_token = {"id": token.i, "orth": token.text, "ner": ner[token.i]}
        tokens.append(new_token)
    sentences = {"id": 0, "tokens": tokens}
    paragraph = {"raw": raw_string, "sentences": sentences}
    result = {"id": str(file_name), "paragraphs": [paragraph]}
    json_input.append(result)

Error …

doc len:  156
span:  {'start': 751, 'end': 761, 'token_start': 156, 'token_end': 157, 'label': 'VALUE'}
token_start:  156
token_end:  157
Traceback (most recent call last):
  File "C:\Users\<>\AppData\Local\Continuum\anaconda3\envs\cit_ner_mod\lib\site-packages\IPython\core\interactiveshell.py", line 3325, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-143-d15be16eab9d>", line 28, in <module>
    offsets = get_entity_tuples(doc, page.get('spans',[]))
  File "<ipython-input-143-d15be16eab9d>", line 15, in get_entity_tuples
    print('doc[token_start]: ', str(doc[token_start]))
  File "doc.pyx", line 295, in spacy.tokens.doc.Doc.__getitem__
  File "token.pxd", line 21, in spacy.tokens.token.Token.cinit
IndexError: [E040] Attempt to access token at 156, max length 156.

I see data-to-spay is available for Prodigy 1.9 and I'm on 1.8.5 I believe and I'm unable to update at this point in time.
Please would you be able to share a manual solution?

Ok, so to get around the challenge with tokenization, I have used the token from Prodigy output to put the doc back together like so..

words = []
for token in tokens:
    words.append(token.get('text'))

doc = Doc(nlp.vocab, words=words)

I need to do some more testing, but looks like this may be my solution.

If there is a better way, I'd be keen to understand it.

Thank you

Anna

The manual solution would be to create Doc objects for each annotation and then run gold.docs_to_json.

And yes, whitespace is not included in the tokens and each token indicates whether it's followed by whitespace or not (Token.whitespace_). Your approach looks good, but you might want to include a list of spaces as well when creating the doc, booleans indicating whether that token at the position is followed by a space or not. Otherwise, a space is assumed, so you'd end up with ["I", "'m", "happy", "!"]"I 'm happy !.