Multi-line named entity

Hi,

I'm trying to annotate blocks of email as signature, disclaimer, reply-blocks but when trying to select the blocks it looks like prodigy does not allow to annotate a block that contains a line break, is there a way around this?

Hi!

Yes, that's by design – for actual named entity tasks, you typically don't want newlines in the spans, because that's pretty much always a mistake. So by default, newline tokens are unselectable. You can turn this behaviour off by setting "allow_newline_highlight": true in your prodigy.json or recipe config. You can read more about newline tokens here: https://prodi.gy/docs/api-interfaces#ner_manual-newlines

From what you describe, highlighting whole paragraphs also seems a bit inefficient, so there's probably a better way to solve this. What's your end goal, and what are you looking to train / do with the data later on?

Actual NER model implementations typically work best for short phrases with distinct start and end points (like noun phrases), so framing your problem as an NER task likely won't work very well, and a text classifier would do much better at the problem. (See this section on NER vs. textcat for some background on this.) In that case, highlighting the exact full spans is also less important. So you could go sentence by sentence and label whether the sentence is of one of the categories you're interested in.

Hi Ines,
thank you for your quick reply.

I am trying to replicate the work of Carvalho et al. in http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_submited.pdf , basically learning to extract signature and reply lines from email.

It doesn't look to fit neither a NER task or a text classification task.
In this particular project I just want to use prodi.gy to label the dataset.

this would be an example:

<other>  From: wcohen@cs.cmu.edu  
<other>  To: Vitor Carvalho <vitor@cs.cmu.edu>  
<other>  Subject: Re: Did you try to compile javadoc recently? 
<other>  Date: 25 Mar 2004 12:05:51 -0500  
<other> 
<other>  Try cvs update –dP, this removes files & directories that have been deleted from cvs. 
<other>  - W 
<other> 
<reply>  On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote: 
<reply>  > I just checked-out the baseline m3 code and 
<reply>  > "Ant dist" is working fine, but "ant javadoc" is not.   
<reply>  > Thanks 
<reply>  > Vitor 
<other> 
<sig>    ------------------------------------------------------------------   
<sig>    William W. Cohen                        “Would you drive a mime 
<sig>    wcohen@cs.cmu.edunuts if you played an  
<sig>    http://www.wcohen.com                           audio tape at full  
<sig>    Associate Research Professor                        blast?” ----  
<sig>    CALD, Carnegie-Mellon University                        S. Wright

Ah, cool! In that case, I definitely think by labelling spans you'd be making the task unnecessarily hard to annotate. Instead, you could go over the text line by line and use the choice interface to select the label that applies to the line? Similar to the textcat.manual recipe with multiple labels.

If you want more context (e.g. previous and next line), you could write a custom recipe and set your stream up like this:

def get_stream():
    for i, line in enumerate(lines):
        prev_line = lines[i - 1] if i != 0 else ""
        next_line = lines[i + 1] if i != len(lines) - 1 else ""
        yield {"html": f"{prev_line}<br /><strong>{line}</strong><br />{next_line}", "id": i}

This will show the lines as HTML, with the line you're currently annotating displayed in bold. The rest of the recipe could look pretty much exactly like this this example.