converting text to json for prodigy

jcbmyrstn · September 16, 2020, 5:05pm

Hi,

I'm trying to convert a text to json, using my own script, so that I can annotate dependencies. The text is in ancient Greek and has different punctuations. So, I want to split the sentences using spacy' sentencizer in this simple way:

sentencizer = Sentencizer(punct_chars=[".", ";", "·"])

the period and semicolon work (question mark in Greek), but the "·" (raised dot) does not. It is ignored by the sentencizer. Any idea why this is so, and solutions?

adriane · September 17, 2020, 7:51am

Check that your tokenizer splits "·" into a separate token, since the sentencizer looks for tokens that match the punctuation, not just characters in the string. (The name punct_chars is a bit misleading. Underneath it's more like more like punct_token_texts. You could also split on tokens like asdf if you wanted.)

If your tokenizer doesn't split "·" into a separate token, you probably want to adjust the suffixes, see: https://spacy.io/usage/linguistic-features#native-tokenizer-additions

jcbmyrstn · September 22, 2020, 4:55pm

Thanks.
The middle dot was actually being tokenized. It turned out to be a problem with the text editor I was using that replaced the middle dot with a similar dot. It seems to be that this is a common problem with some Greek unicode characters that are replaced with alternatives by some text editors. I wonder If I should add those punctuation marks to the language subclass or just leave it as it is to force future users to use the right utf characters.

Topic		Replies	Views
Sentencizer configuration questions spacy	2	842	February 19, 2023
"data-to"spacy" does not sentencize text based on custom sentencizer. enhancement , ner , done , spacy	2	1265	June 17, 2020
Punctuation mark taken as decimal point by mistake (NER) solved	6	360	September 14, 2022
Improving the senter's performance	1	285	September 13, 2022
Sentencize already annotated data usage , spacy , solved , training	2	506	January 4, 2022

converting text to json for prodigy

Related topics