Hi @alphie,
If you're sure normalizing the new lines in preprocessing is not an option, yes, you could try constraining spancat suggester to sentence boundaries.
I would check first if sentence boundaries are indeed a problem here. It might be that the new-line characters result in in incorrect spancat candidates or inconsistent annotations or maybe there are not enough examples of spans with new lines inside.
The first thing I recommend you do is to run spacy debug data
on your span annotated dataset. This should raise a warning if the annotated spans are are of good quality from the modelling perspective.
Second, I'd try to see if the annotations you create with Prodigy are compatible with the spancat suggester you use for training the spancat component. The spancat component performs two steps: 1) span candidate generation and 2) span candidate classification (you can find out more about how spancat works by reading this blog). If your annotations are incompatible with the suggestions, the model won't learn anything useful. If that's the case you'll need to modify the suggester function.
For a quick check if you spans are compatible, you can pass the suggester function you used in training to to spans.manual
with the --sugester
parameter and try to annotate the spans. If spans are incompatible a pop up with a warning will be raised.
Since Prodigy CLI won't allow passing arguments to the registered function, if you want to use the default spaCy suggester spacy.ngram.suggester_v1, you'll need to wrap it in a function to be able to pass the requried arguments e.g.:
from spacy import registry
from spacy.pipeline.spancat import build_ngram_suggester
from spacy.pipeline.spancat import ngram_suggester
from spacy.pipeline.spancat import Suggester
from functools import partial
@registry.misc("spacy.my_ngram_suggester.v1")
def build_ngram_suggester() -> Suggester:
"""Suggest all spans of the given lengths. Spans are returned as a ragged
array of integers. The array has two columns, indicating the start and end
position."""
sizes = [2,3,4,5,6,7,8,9,10]
return partial(ngram_suggester, sizes=sizes)
Then you should be able to pass it to Prodigy recipe like so:
python -m prodigy spans.manual test_spans blank:en news_headlines.jsonl --label A,B --suggester "spacy.my_ngram_suggester.v1" -F my_suggester.py
If you would like to validate your existing annotations programatically, you could write a script that processes your annotations and compares it with the suggestions. You could reuse most of the logic of the validate_with_suggester
function available in the spans.py
source code (you can access it from your prodigy package installation folder, then recipes/spans.py
- run prodigy stats
to find out the exact path)
If you get to a conclusion that you need to constrain the spancat suggester to sentence boundaries, you would need a custom suggester function that uses the output sentencizer
or senter
(if you're going to train a component as you were planning).
spacy-experimental
has some examples of suggester function variations including sentece-suggester which contains a function that constrains n-gram suggestions to sentence boundaries.
So the suggested workflow would be as follows:
- implement a custom suggester function that makes sure the output is within sentence boundaries (take this suggester as an example)
- try annotating with
spans.manual
and the built-in rule based sentencizer
and your custom suggester function to see how the suggestions look like. Perhaps the new lines do not trip the sentencizer
and the with the new suggester, the suggestions include the new lines. It's recommended to pass a custom suggester function to prodigy spans.manual
recipe with --sugester
function as explained above to make sure your annotations are compatible with model's candidates at the inference time.
- If it turns out that a trained SenteceRecognizer is needed, you would work with
sent.correct
to train the senter
component. This workflow would be completely independent of spancat for training and evaluation.
- Once you have a satisfactory
senter
component, you repeat step 2) with the spaCy pipeline form step 3) and your custom suggester function from step 1)