multiline spancat

Using spancat I am annotating longer sentences. The data was extracted as a text stream from a pdf and has lots of newline characters.

type

Spancat will not annotate across the newline.

Q1: Is it possible to annotate as one span crossing the newline?

This issue also comes up in NER, where an entity may be broken by a newline. Q2: using NER is it possible to annotate an entity across a newline

Q3: is it even a good idea, or might it throw the model (the problem is that newlines will really vary depending on the size of the box on the pdf, so they break sentences (spans) and sometimes even words (ner/ents)

Q4: Is there a better approach – my model is #working# in terms of recognising spans, but I am getting getting out weirdly truncated sentences!

Hi @alphie!

Apologies for the delay in reply!
We have a doc section on handling the new line characters:

Why does Prodigy add ↵ characters for newlines?

A newline only produces a line break and is otherwise invisible. So if your input data contains many newlines, they can be easy to miss during annotation. To your model however, \n is just a unicode character. If you’re labelling entities and accidentally include newlines in some of them and not others, you’ll end up with inconsistent data and potentially very confusing and bad results. To prevent this, Prodigy will show you a for each newline in the data (in addition to rendering the newline itself). The tab character \t will be replaced by , by the way, for the same reason.

As of v1.9, tokens containing only newlines (or only newlines and whitespace) are unselectable by default, so you can’t include them in spans you highlight. To disable this behavior, you can set "allow_newline_highlight": true in your prodigy.json.

So yes, re Q1/Q2 it is possible to configure Prodigy to include newlines in spans.
Re Q3/Q4 - as mentioned in the quote from the docs, new lines are just tokens to the model like any other characters so including them in an inconsistent way will of course have a negative impact. The best solution would probably be data cleaning/preprocessing step to normalize these newlines e.g. substituting sequences of multiple new lines to just one new line, and try to remove the spurious new lines that do not appear at the end of the sentence. (Of course, the same preprocessing will have to be applied at inference time)

Thanks for the reply. It could be quite a task to work out which new lines to remove in the input data. I'm wondering about the model learning sentences eg using sent.correct. If that was a potential method, can you explain to me a workflow. eg do I do spancat first and then sent.correct on the same model? I'm a bit hazy on how to get the whole pipeline thing going so a pointer to the docs on that would also be great.

Hi @alphie,

If you're sure normalizing the new lines in preprocessing is not an option, yes, you could try constraining spancat suggester to sentence boundaries.
I would check first if sentence boundaries are indeed a problem here. It might be that the new-line characters result in in incorrect spancat candidates or inconsistent annotations or maybe there are not enough examples of spans with new lines inside.

The first thing I recommend you do is to run spacy debug data on your span annotated dataset. This should raise a warning if the annotated spans are are of good quality from the modelling perspective.

Second, I'd try to see if the annotations you create with Prodigy are compatible with the spancat suggester you use for training the spancat component. The spancat component performs two steps: 1) span candidate generation and 2) span candidate classification (you can find out more about how spancat works by reading this blog). If your annotations are incompatible with the suggestions, the model won't learn anything useful. If that's the case you'll need to modify the suggester function.
For a quick check if you spans are compatible, you can pass the suggester function you used in training to to spans.manual with the --sugester parameter and try to annotate the spans. If spans are incompatible a pop up with a warning will be raised.
Since Prodigy CLI won't allow passing arguments to the registered function, if you want to use the default spaCy suggester spacy.ngram.suggester_v1, you'll need to wrap it in a function to be able to pass the requried arguments e.g.:

from spacy import registry
from spacy.pipeline.spancat import build_ngram_suggester
from spacy.pipeline.spancat import ngram_suggester
from spacy.pipeline.spancat import Suggester
from functools import partial
@registry.misc("spacy.my_ngram_suggester.v1")
def build_ngram_suggester() -> Suggester:
    """Suggest all spans of the given lengths. Spans are returned as a ragged
    array of integers. The array has two columns, indicating the start and end
    position."""
    sizes = [2,3,4,5,6,7,8,9,10]
    return partial(ngram_suggester, sizes=sizes)

Then you should be able to pass it to Prodigy recipe like so:

python -m prodigy spans.manual test_spans blank:en news_headlines.jsonl --label A,B --suggester "spacy.my_ngram_suggester.v1" -F my_suggester.py

If you would like to validate your existing annotations programatically, you could write a script that processes your annotations and compares it with the suggestions. You could reuse most of the logic of the validate_with_suggester function available in the spans.py source code (you can access it from your prodigy package installation folder, then recipes/spans.py - run prodigy stats to find out the exact path)

If you get to a conclusion that you need to constrain the spancat suggester to sentence boundaries, you would need a custom suggester function that uses the output sentencizer or senter (if you're going to train a component as you were planning).
spacy-experimental has some examples of suggester function variations including sentece-suggester which contains a function that constrains n-gram suggestions to sentence boundaries.

So the suggested workflow would be as follows:

  1. implement a custom suggester function that makes sure the output is within sentence boundaries (take this suggester as an example)
  2. try annotating with spans.manual and the built-in rule based sentencizer and your custom suggester function to see how the suggestions look like. Perhaps the new lines do not trip the sentencizer and the with the new suggester, the suggestions include the new lines. It's recommended to pass a custom suggester function to prodigy spans.manual recipe with --sugester function as explained above to make sure your annotations are compatible with model's candidates at the inference time.
  3. If it turns out that a trained SenteceRecognizer is needed, you would work with sent.correct to train the sentercomponent. This workflow would be completely independent of spancat for training and evaluation.
  4. Once you have a satisfactory senter component, you repeat step 2) with the spaCy pipeline form step 3) and your custom suggester function from step 1)