converting text to json for prodigy


I'm trying to convert a text to json, using my own script, so that I can annotate dependencies. The text is in ancient Greek and has different punctuations. So, I want to split the sentences using spacy' sentencizer in this simple way:

sentencizer = Sentencizer(punct_chars=[".", ";", "·"])

the period and semicolon work (question mark in Greek), but the "·" (raised dot) does not. It is ignored by the sentencizer. Any idea why this is so, and solutions?

Check that your tokenizer splits "·" into a separate token, since the sentencizer looks for tokens that match the punctuation, not just characters in the string. (The name punct_chars is a bit misleading. Underneath it's more like more like punct_token_texts. You could also split on tokens like asdf if you wanted.)

If your tokenizer doesn't split "·" into a separate token, you probably want to adjust the suffixes, see:

The middle dot was actually being tokenized. It turned out to be a problem with the text editor I was using that replaced the middle dot with a similar dot. It seems to be that this is a common problem with some Greek unicode characters that are replaced with alternatives by some text editors. I wonder If I should add those punctuation marks to the language subclass or just leave it as it is to force future users to use the right utf characters.