Multi-line named entity

Hi,

I'm trying to annotate blocks of email as signature, disclaimer, reply-blocks but when trying to select the blocks it looks like prodigy does not allow to annotate a block that contains a line break, is there a way around this?

Hi!

Yes, that's by design – for actual named entity tasks, you typically don't want newlines in the spans, because that's pretty much always a mistake. So by default, newline tokens are unselectable. You can turn this behaviour off by setting "allow_newline_highlight": true in your prodigy.json or recipe config. You can read more about newline tokens here: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP

From what you describe, highlighting whole paragraphs also seems a bit inefficient, so there's probably a better way to solve this. What's your end goal, and what are you looking to train / do with the data later on?

Actual NER model implementations typically work best for short phrases with distinct start and end points (like noun phrases), so framing your problem as an NER task likely won't work very well, and a text classifier would do much better at the problem. (See this section on NER vs. textcat for some background on this.) In that case, highlighting the exact full spans is also less important. So you could go sentence by sentence and label whether the sentence is of one of the categories you're interested in.

Hi Ines,
thank you for your quick reply.

I am trying to replicate the work of Carvalho et al. in http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_submited.pdf , basically learning to extract signature and reply lines from email.

It doesn't look to fit neither a NER task or a text classification task.
In this particular project I just want to use prodi.gy to label the dataset.

this would be an example:

<other>  From: wcohen@cs.cmu.edu  
<other>  To: Vitor Carvalho <vitor@cs.cmu.edu>  
<other>  Subject: Re: Did you try to compile javadoc recently? 
<other>  Date: 25 Mar 2004 12:05:51 -0500  
<other> 
<other>  Try cvs update –dP, this removes files & directories that have been deleted from cvs. 
<other>  - W 
<other> 
<reply>  On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote: 
<reply>  > I just checked-out the baseline m3 code and 
<reply>  > "Ant dist" is working fine, but "ant javadoc" is not.   
<reply>  > Thanks 
<reply>  > Vitor 
<other> 
<sig>    ------------------------------------------------------------------   
<sig>    William W. Cohen                        “Would you drive a mime 
<sig>    wcohen@cs.cmu.edunuts if you played an  
<sig>    http://www.wcohen.com                           audio tape at full  
<sig>    Associate Research Professor                        blast?” ----  
<sig>    CALD, Carnegie-Mellon University                        S. Wright

Ah, cool! In that case, I definitely think by labelling spans you'd be making the task unnecessarily hard to annotate. Instead, you could go over the text line by line and use the choice interface to select the label that applies to the line? Similar to the textcat.manual recipe with multiple labels.

If you want more context (e.g. previous and next line), you could write a custom recipe and set your stream up like this:

def get_stream():
    for i, line in enumerate(lines):
        prev_line = lines[i - 1] if i != 0 else ""
        next_line = lines[i + 1] if i != len(lines) - 1 else ""
        yield {"html": f"{prev_line}<br /><strong>{line}</strong><br />{next_line}", "id": i}

This will show the lines as HTML, with the line you're currently annotating displayed in bold. The rest of the recipe could look pretty much exactly like this this example.

(setting the json line, does let user select it, so it solves the original problem. - thanks)

with "span" classification (unlike NER) a span might be a lot longer than the typical NER (and contain a few lines, e.g. an email signature block has several lines)

Question: what's the "best practice" for span length?
should I try to capture the entire signature block (which may be 5-6 lines with 3-8 words on each)?

or would it be better to capture the "first line in a block" ?

Hey @vish ,

I suppose that showing lines within some local context should be enough for the annotator to make the decision. The good thing about Prodigy is that you can program the way examples are shown to the user leveraging patterns like the one you've mentioned.

Not sure I understand your question about span categorization. In Prodigy you can annotate overlapping spans with spans.manual. This data can be used to train spaCy SpanCategorizer .

editing the question I asked - to clarify

Hey @vish

Span Categorizer can handle spans longer than NER, but spans of several lines including entire paragraphs
can affect the model performance due the number of span suggestions from the n-gram suggester.
Here's a relevant spaCy post on the topic:

You might need to experiment with the configuration of the suggester function as explained in this thread on memory errors with long spans: spancat out of memory - #2 by ines

If your spans cover entire sentences and sequences of them, reframing the task as text (sentences) classification will often be easier to learn.