Starting with XML-tagged Corpus

Hi all,

I have a significant volume of text where my entities of interest are set off in XML tags, inline with the text. For example:

This is the kind of <ner>text</ner>, which includes <different_ner>tagged</different_ner> content inside it.

These tags are probably, but not always correct and usually, but not always, comprehensive.

Is there a recipe out there that shows what that kind of importation process might look like with Prodigy?


Hi! I guess this really mostly comes down to the data transformation and converting your XML to JSON while preserving the character offsets. I’m no expert on XML parsing in Python, but it sounds like it should be doable? Given your example above, you’ll want a result like this for each <section>:

    "text": "This is the kind of text, which includes tagged content inside it.",
    "spans": [
        {"start": 20, "end": 24, "label": "ner"},
        {"start": 41, "end": 47, "label": "different_ner"}

Once you have data in this format, you can save it as .json or .jsonl and run it through a recipe like ner.manual to correct the annotations. Prodigy respects pre-defined entities, so the existing spans will already be highlighted, and you can remove the mistakes and add new ones if necessary. At the end of the process, you’ll end up with a dataset in the database containing your corrected annotations in a straightforward JSON format (which is hopefully also easier to work with than XML going forward).

(Btw, one thing to watch out for: ner.manual will pre-tokenize the text to make highlighting easier, because selection can snap to token boundaries. It also makes it easier to spot tokenization issues and to prevent the annotations from containing labeled token spans that will never occur in the model in “real life”. However, this also means that the existing labeled spans need to match the model’s tokenization. Depending on where your existing data is from and how it was labelled, there might be examples where tokenization and entities don’t match. In that case, Prodigy will raise an error and let you know, so you can adjust the example or provide your own "tokens" property with the intended tokenization.)

I just wrote a little package that could help with extracting basic XML into this kind of stand-off representation