In "textcat" recipes, is it possible to format the to-be-annotated texts?

Hi,

in textcat recipes, can one use markdown (e.g., '*' to make a text bold) or other formatting (like HTML's '') to format the text that is shown to users, or is it plain text only? Alternatively, in textcat, could one use the annotation styles from, for instance, NER, to highlight certain parts of the text (but only for visual purposes, not allowing users to actually modify these annotations)?

Thank you in advance :slight_smile:

Cheers,
Felix

Hi! If you data contains a "spans" property, those will be highlighted in the text, just like in the NER interfaces. They're only for display purposes, though – during training, Prodigy will only use the "text" and the "label".

Alternatively, you could also add an "html" key to your tasks and use that for the formatted version. Just make sure to also include the plain text version as "text", so you don't lose the raw data to train from. A task in your input source could then look something like this:

{
    "label": "SOME_LABEL",
    "text": "This is some text.",
    "html": "This is some <strong>text</strong>."
}

That's awesome - thanks!

1 Like

Hi again, unfortunately, the html field does not seem to be used by prodigy. One annotation task in my input dataset looks like this:
{"text": "Nabil Abu Rdainah, a spokesman for Palestinian President Mahmoud Abbas, said that enacting both laws would force the Palestinians to appeal to international bodies.", "dbid": 8811895, "netype": "PERSON", "targetphrase": "Mahmoud Abbas", "html": "Nabil Abu Rdainah, a spokesman for Palestinian President <strong>Mahmoud Abbas</strong>, said that enacting both laws would force the Palestinians to appeal to international bodies."}

As you can see, it contains the fields text, html, and some others, which I will need again after using prodigy (dbid, netype, targetphrase). However, the web app does not print Mahmoud Abbas in bold:

What am I missing?

Ahh, I'm glad this came up: the problem here seems to be that the textcat.teach recipe always adds an empty "spans" key to the incoming stream, which tricks the interface into thinking the data should be rendered with spans and not as HTML. So given the original task, the interface would correctly interpret it as HTML – but not with the spans added by the recipe. I'll fix this for the next release, but in the meantime, you could add a workaround like this to the end of the recipe:

def remove_empy_spans_from_stream(stream):
    for eg in stream:
        del eg["spans"]
        yield eg

stream = remove_empy_spans_from_stream(stream)

Thanks a lot! :o)

Just released v1.8.4, which should fix this under the hood!

I already got the mail - excellent! Thank you!