Hi! I think the problem here is that "html"
input only works with the "html"
annotation interface. If you’re annotating data with ner.manual
, you’ll be selecting and labelling tokens – and that really only works on raw text. That’s why the recipe will only use the "text"
key that’s present in your data.
If you pass in "<strong>hello</strong>"
, there’s no clear solution for how this should be handled. How should it be tokenized, and what are you really labelling here? The underlying markup or just the text, and what should the character offsets point to? And how should other markup be handled, e.g. images or complex, nested tags?
Similarly, if you’re planning on training a model later on, that model will also get to see the raw text, including the markup – so if you are working with raw HTML (like, web dumps or something), you usually always want to see the original raw text that the model will be learning from. Otherwise, the model might be seeing data/markup that you didn’t see during annotation, which is always problematic.
This is btw also why the ner_manual
interface will show you whitespace characters as subtle icons (instead of just swallowing or rendering them as they are). For example, it’s super important to see whether the spans you’re highlighting include tabs or newlines – otherwise, this can have pretty bad effects on your model. If you’re annotating with a model in the loop, you also want to clearly see what exactly the model is highlighting and predicting there.
I’m not 100% sure what you’re trying to label in your HTML markup – but one thing you could do is tokenize the text, remove the HTML markup tokens but keep the original token indices on all other tokens, so you can always map them back to the tokens in your original data. This lets you annotate the raw text in a nice and readable way – and when you’re done, you can extract the tokens of the highlighted spans and map them back to their positions in the source document.
If you just feed in raw text, Prodigy / spaCy will take care of the tokenization for you – but you can also feed in data in the following format with pre-defined "tokens"
:
{
"text": "Hello Apple",
"tokens": [
{"text": "Hello", "start": 0, "end": 5, "id": 0},
{"text": "Apple", "start": 6, "end": 11, "id": 1}
]
}
When you annotate a span, Prodigy will then save the following to your dataset:
{
"text": "Hello Apple",
"tokens": [
{"text": "Hello", "start": 0, "end": 5, "id": 0},
{"text": "Apple", "start": 6, "end": 11, "id": 1}
],
"spans": [
{"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
]
}