In "textcat" recipes, is it possible to format the to-be-annotated texts?

fhamborg · October 3, 2019, 9:45pm

Hi,

in textcat recipes, can one use markdown (e.g., '*' to make a text bold) or other formatting (like HTML's '') to format the text that is shown to users, or is it plain text only? Alternatively, in textcat, could one use the annotation styles from, for instance, NER, to highlight certain parts of the text (but only for visual purposes, not allowing users to actually modify these annotations)?

Thank you in advance

Cheers,
Felix

ines · October 3, 2019, 10:03pm

Hi! If you data contains a "spans" property, those will be highlighted in the text, just like in the NER interfaces. They're only for display purposes, though – during training, Prodigy will only use the "text" and the "label".

Alternatively, you could also add an "html" key to your tasks and use that for the formatted version. Just make sure to also include the plain text version as "text", so you don't lose the raw data to train from. A task in your input source could then look something like this:

{
    "label": "SOME_LABEL",
    "text": "This is some text.",
    "html": "This is some <strong>text</strong>."
}

fhamborg · October 4, 2019, 7:19am

That's awesome - thanks!

fhamborg · October 7, 2019, 12:04pm

Hi again, unfortunately, the html field does not seem to be used by prodigy. One annotation task in my input dataset looks like this:
{"text": "Nabil Abu Rdainah, a spokesman for Palestinian President Mahmoud Abbas, said that enacting both laws would force the Palestinians to appeal to international bodies.", "dbid": 8811895, "netype": "PERSON", "targetphrase": "Mahmoud Abbas", "html": "Nabil Abu Rdainah, a spokesman for Palestinian President <strong>Mahmoud Abbas</strong>, said that enacting both laws would force the Palestinians to appeal to international bodies."}

As you can see, it contains the fields text, html, and some others, which I will need again after using prodigy (dbid, netype, targetphrase). However, the web app does not print Mahmoud Abbas in bold:

What am I missing?

ines · October 7, 2019, 12:22pm

Ahh, I'm glad this came up: the problem here seems to be that the textcat.teach recipe always adds an empty "spans" key to the incoming stream, which tricks the interface into thinking the data should be rendered with spans and not as HTML. So given the original task, the interface would correctly interpret it as HTML – but not with the spans added by the recipe. I'll fix this for the next release, but in the meantime, you could add a workaround like this to the end of the recipe:

def remove_empy_spans_from_stream(stream):
    for eg in stream:
        del eg["spans"]
        yield eg

stream = remove_empy_spans_from_stream(stream)

fhamborg · October 7, 2019, 12:42pm

Thanks a lot! :o)

ines · October 7, 2019, 3:45pm

Just released v1.8.4, which should fix this under the hood!

fhamborg · October 7, 2019, 3:58pm

I already got the mail - excellent! Thank you!

Topic		Replies	Views
Annotate Raw HTML usage , front-end , solved	2	1059	January 23, 2020
text classification - is prodigy a good fit for the project? usage , textcat	2	678	October 22, 2019
Capture form data from html annotations enhancement , usage , front-end	5	1892	September 24, 2018
Highlight list of terms in `textcat.manual` for binary annonation usage , textcat	2	412	April 21, 2022
Email rendering in Text field - HTML usage , ner	6	760	February 20, 2020

In "textcat" recipes, is it possible to format the to-be-annotated texts?

Related topics