"data-to"spacy" does not sentencize text based on custom sentencizer.


Following Ines' code example here: https://support.prodi.gy/t/adding-custom-attribute-to-doc-having-ner-use-attribute/356/6, I tried to create a custom sentencizer to deal with bullet points "•".

So I added a "custom_sentencizer" to the pipeline in meta.json and modified the model package's __init__.py as follows:

# coding: utf8
from __future__ import unicode_literals

from pathlib import Path
from spacy.util import load_model_from_init_py, get_model_meta
from spacy.language import Language

__version__ = get_model_meta(Path(__file__).parent)['version']

def load(**overrides):
    Language.factories['custom_sentencizer'] = lambda nlp, **cfg: CustomSentencizer(nlp, **cfg)
    return load_model_from_init_py(__file__, **overrides)

class CustomtSentencizer(object):
    name = 'custom_sentencizer'

    def __init__(self, nlp, **cfg):
        self.model = nlp

    def __call__(self, doc):
        for i, token in enumerate(doc[:-3]):
            if token.text == '•':
                doc[i].is_sent_start = True
                doc[i+1].is_sent_start = False
        return doc

Then I packaged it into en_custom and pip installed the tar.gz file.

However, after I annotated this test.jsonl file

{"text":"• This is sentence 1 • This is sentence 2"}


prodigy ner.manual test en_custom ./test.jsonl 

and converted to spaCy's JSON format via

prodigy data-to-spacy test_spacy.json --ner test

the text was not sentencized in the output JSON file i.e. there was still only 1 sentence under "sentences".

May I know what am I doing wrongly?

P.S. My en_custom model could sentencize the "text" above properly when I tried it on Jupyter Notebook.

Hi! It looks like you've done everything correctly in terms of setting up and packaging your sentencizer :slightly_smiling_face:

It looks like you've hit an interesting edge case here in data-to-spacy: the recipe currently uses a blank model with the default sentencizer to process the examples (mainly tokenization and sentence segmentation). ner.manual doesn't segment any sentences and just show you whatever you stream in – so you'll want to use your custom sentencizer when you conver the data for spaCy.

The easiest workaround for now would be to find the location of your Prodigy installation (you can run prodigy stats to get the path) and then open prodigy/recipes/train.py and find the data_to_spacy recipe function. You can then modify the calls to spacy.blank(lang) and nlp.add_pipe (first few lines) and either hard-code your model, or change it to spacy.load and remove the default sentencizer if you want to be able to pass in a model name instead of just a language code via --lang.

I'll try to think of a good way to solve this in a general-purpose way :thinking: I think just allowing a custom base model for tokenization and sentence segmentation instead of just a base language should work fine.

Just released Prodigy v1.10, which allows a --base-model argument on the data-to-spacy recipe. Ideally, you'd still have to call your component sentencizer so it's not disabled when the Doc objects are created (or perform your segmentation in the tokenizer). I'd love to resolve this as well, but this would require some deeper refactoring.