Using ner.manual on HTML Input

Hi everyone,

I am tackling my very first project using Prodigy. Eventually I want to use Prodigy to train a complete new entity. My inputs are HTML files and I would like to pass entire HTML files to Prodigy and let it select spans to present the annotator.

For the moment, to get started, I want to iterate through small HTML snippets. The command I run is

<input.jsonl prodigy ner.manual example-dataset1 en_core_web_md -l ORG,PERSON,PRODUCT

My input.jsonl looks as follows:

{"text": "my text"}
{"text":  "more text"}

That works fine. However, if I replace the keys “text” with “html”, obtaining

{"html": "my text"}
{"html":  "more text"}

, I get the error

ValueError: Error while validating stream: no first batch. This likely means that your stream is empty.

Could anyone please point out what I am doing wrong here?

Hi! I think the problem here is that "html" input only works with the "html" annotation interface. If you’re annotating data with ner.manual, you’ll be selecting and labelling tokens – and that really only works on raw text. That’s why the recipe will only use the "text" key that’s present in your data.

If you pass in "<strong>hello</strong>", there’s no clear solution for how this should be handled. How should it be tokenized, and what are you really labelling here? The underlying markup or just the text, and what should the character offsets point to? And how should other markup be handled, e.g. images or complex, nested tags?

Similarly, if you’re planning on training a model later on, that model will also get to see the raw text, including the markup – so if you are working with raw HTML (like, web dumps or something), you usually always want to see the original raw text that the model will be learning from. Otherwise, the model might be seeing data/markup that you didn’t see during annotation, which is always problematic.

This is btw also why the ner_manual interface will show you whitespace characters as subtle icons (instead of just swallowing or rendering them as they are). For example, it’s super important to see whether the spans you’re highlighting include tabs or newlines – otherwise, this can have pretty bad effects on your model. If you’re annotating with a model in the loop, you also want to clearly see what exactly the model is highlighting and predicting there.

I’m not 100% sure what you’re trying to label in your HTML markup – but one thing you could do is tokenize the text, remove the HTML markup tokens but keep the original token indices on all other tokens, so you can always map them back to the tokens in your original data. This lets you annotate the raw text in a nice and readable way – and when you’re done, you can extract the tokens of the highlighted spans and map them back to their positions in the source document.

If you just feed in raw text, Prodigy / spaCy will take care of the tokenization for you – but you can also feed in data in the following format with pre-defined "tokens":

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ]
}

When you annotate a span, Prodigy will then save the following to your dataset:

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
    ]
}

Thank you very much for your answer. It is very helpful to me. However, I have some follow up questions.

How would I use the HTML interface? I understood your arguments and agree, but I might want to play around with it and see whether that might be helpful for me.

Lets say I create an enormous JSONL containing extracts from all my HTML files. How could I add custom information, e.g. the original HTML file name to it? Adopting your example, I want to start with

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "extra_information": {
        "file_hash": "ebb6f378660162c0630182996808d4a9",
        "file_id": "0001409970-17-000318"
    }
}

, annotate spans and export the annotations to

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
    ],
    "extra_information": {
        "file_hash": "ebb6f378660162c0630182996808d4a9",
        "file_id": "0001409970-17-000318"
    }
}

I hope that this example clarifies the intend. Any other way to preserve the source file of text snippets is equally welcome.

Thank you very much.

You can always write a custom recipe, but the easiest way would be to run the built-in mark recipe, which will simply stream in whatever comes in and render it with a given interface. For example:

prodigy mark your_dataset your_data.jsonl --view-id html

You could also experiment with different ways of breaking down the annotation into smaller binary decisions. For example, maybe you're able to extract candidates for the highlighted spans programmatically, e.g. via matcher rules or regular expressions. Even if there are many false positives, you'll be able to click through them very quickly and you'd probably still be faster than if you selected them manually. It'll also give you more consistent annotations, and you'll be able to spot potential problems or difficulties in the data that might also be tricky for a statistical model to learn later on.

Btw, if you want to try out more dynamic interfaces, there's also experimental support for custom JavaScript – see this thread for discussion and examples. I'd still recommend to use it sparingly, though – it's very tempting to overcomplicate the task, but we've found that you collect much better data if you're able to break the task down to a series of simple decisions, rather than fewer complex ones.

Your example should work out-of-the-box! Prodigy will preserve any additional properties in the data and simply pass them through. You can add your own custom property on the root object, or even on the individual tokens etc. The data can be anything, as long as it's JSON-serializable :slightly_smiling_face: