Starting with XML-tagged Corpus

fros1y · November 15, 2018, 5:00pm

Hi all,

I have a significant volume of text where my entities of interest are set off in XML tags, inline with the text. For example:

<document>
...
<section>
This is the kind of <ner>text</ner>, which includes <different_ner>tagged</different_ner> content inside it.
</section>
....
</document>

These tags are probably, but not always correct and usually, but not always, comprehensive.

Is there a recipe out there that shows what that kind of importation process might look like with Prodigy?

Thanks,

ines · November 16, 2018, 11:34am

Hi! I guess this really mostly comes down to the data transformation and converting your XML to JSON while preserving the character offsets. I’m no expert on XML parsing in Python, but it sounds like it should be doable? Given your example above, you’ll want a result like this for each <section>:

{
    "text": "This is the kind of text, which includes tagged content inside it.",
    "spans": [
        {"start": 20, "end": 24, "label": "ner"},
        {"start": 41, "end": 47, "label": "different_ner"}
    ]
}

Once you have data in this format, you can save it as .json or .jsonl and run it through a recipe like ner.manual to correct the annotations. Prodigy respects pre-defined entities, so the existing spans will already be highlighted, and you can remove the mistakes and add new ones if necessary. At the end of the process, you’ll end up with a dataset in the database containing your corrected annotations in a straightforward JSON format (which is hopefully also easier to work with than XML going forward).

(Btw, one thing to watch out for: ner.manual will pre-tokenize the text to make highlighting easier, because selection can snap to token boundaries. It also makes it easier to spot tokenization issues and to prevent the annotations from containing labeled token spans that will never occur in the model in “real life”. However, this also means that the existing labeled spans need to match the model’s tokenization. Depending on where your existing data is from and how it was labelled, there might be examples where tokenization and entities don’t match. In that case, Prodigy will raise an error and let you know, so you can adjust the example or provide your own "tokens" property with the intended tokenization.)

millawell · June 28, 2019, 3:23pm

I just wrote a little package that could help with extracting basic XML into this kind of stand-off representation https://github.com/millawell/standoffconverter/

Topic		Replies	Views
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Using a handmade annotation file for model training ner , best-practices	3	1627	June 22, 2018
Processing annotated data usage , ner	1	312	January 20, 2022
how to extend an already labeled corpus? usage , ner , solved	5	1085	June 29, 2019
Re-labling custom dataset with Prodigy usage , ner	2	606	June 28, 2021

Starting with XML-tagged Corpus

Related topics