ValueError when trying to export Cyrillic annotations from a reviewed dataset

ivo.s · October 4, 2023, 9:42am

I am trying to export NER annotations in Cyrillic, that were annotated by several annotators and then reviewed and saved in a separate dataset, but I get the same unicode related error. I don’t have that issue if I try to export any single annotator’s work, so it seems that something went wrong while doing the review.

Tried several things:

data-to-spacy
prodigy data-to-spacy ./corpus --ner ner_news_final --eval-split 0.2 --base-model bg_model --lang bg --verbose
db-out
prodigy db-out ner_news_final > ./ner_news_final.jsonl
even printing returns the same error
prodigy print-dataset ner_news_final | less -r

This is the error for all attempts:

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
Traceback (most recent call last):
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/__main__.py", line 63, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 872, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/train.py", line 481, in data_to_spacy
    train_docs, dev_docs, pipes = merge_data(
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 139, in merge_data
    corpus = create_merged_corpus(**readers, eval_split=eval_split)
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 857, in create_merged_corpus
    data[reader_name] = reader(
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 978, in read_ner_annotations
    examples, eval_examples = get_train_eval_examples(
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 927, in get_train_eval_examples
    examples = load_examples(DB, datasets)
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 156, in load_examples
    examples = db.get_dataset_examples(set_id) or []
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/components/db.py", line 589, in get_dataset_examples
    examples = list(self.iter_dataset_examples(name, session=session))
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/components/db.py", line 612, in iter_dataset_examples
    yield eg.load()
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/components/db.py", line 189, in load
    return cast(Dict[str, Any], srsly.json_loads(content))
  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/srsly/_json_api.py", line 39, in json_loads
    return ujson.loads(data)
ValueError: Unterminated unicode escape sequence when decoding 'string'

I also tried connecting to my DB and by iterating through each row I can narrow it down to around the 32nd example in the dataset, when I get the same error returned:

from prodigy.components.db import connect

db = connect(
    db_id="mysql",
    db_settings={
        "user": "#########",
        "password": "#########",
        "host": "#########",
        "port": "#########",
        "database": "#########",
        "ssl": {"ssl": {"ssl-ca": "certificate.crt.pem"}},
    },
)

dataset = db.iter_dataset_examples("ner_news_final")

I’m on Mac M1, tried both arm64 and x86 environments. Prodigy is version 1.13.1.

Is there something I can do on my end to work around this issue?
If not is there any way I can drop the first 32 values from my dataset and save it in a new one and export from there? Can’t figure out how to filter the datasets manually and I was hoping to avoid going through the whole review process again (~2000 examples)

Thanks,
Ivo

koaning · October 5, 2023, 2:41pm

Hi there.

I'll gladly help you out here. It might be a bug on our end, but let's see if a custom Python snippet might do the trick.

You mentioned that you tried iterating yourself, do mean via a script like below or something else?

from prodigy.components.db import connect() 

db = connect()
examples = db.get_dataset_examples("ner_news_final")

Am I correct in saying this script runs with an error? If so, could you share some Cyrillic text so that I may try and reproduce the error locally?

ivo.s · October 6, 2023, 7:49am

Thanks for the reply, @koaning!
Yes, I first tried with db.get_dataset_examples("ner_news_final") and got the same ValueError, like in the original post. So instead I was able to print the first 31 values using this :

dataset = db.iter_dataset_examples("ner_news_final")
for element in dataset:
    print(element)

Sharing data seems tricky, as I can't figure out which of the text examples is causing that error (the order in the UI seems to be different that the order that I get when iterating from the DB, so with a random subset of the data you might not be able to reproduce it.

I can share my full dataset (~10.2MB txt file), just let me know what would be the best way to do that?

In the meantime here is a small sample:

Основната цел - да бъдат "изпрани" споменатите от Петков В размяната на реплики между прокуратурата и премиера политологът и преподавател Евгений Дайнов вижда две явни и една скрита картини. Той смята, че ситуацията е "чисто политическа" и че тя е основана на "забавянето на законодателството, което да освободи обществото от произвола на прокуратурата". Според Дайнов именно вследствие на това забавяне Кирил Петков е решил да предприеме "някакви действия" и затова е съобщил имената на основните замесени в корупционни схеми в България. Включително той направи това и по време на посещението си в Брюксел наскоро, когато за пореден път спомена поименно депутата от ДПС Делян Пеевски. " Те се знаят от всички, нали не си правим други илюзии" , казва Дайнов с уточнението, че в действията на прокуратурата пък се забелязва "очевидната цел да изпере тези 20 души". "Не бива да забравяме, че немалка част от тези хора са с доказани в годините отношения или с Бойко Борисов, или със самата прокуратура", допълва политологът. Той обаче е на мнение, че сметката, която Гешев си е правел с призоваването на премиера, за да го накара да даде "списък с лица", е излязла крива.

По думите на Дайнов причината за това са културологичните разлики между Петков и Гешев, който "по-скоро е подобен на Борисов". От действията и на главния прокурор, и на министър-председателя се вижда, че първият опитва да насажда "чисто политически страх". Но именно заради това, че Гешев и управляващите са "две вселени" този път номерът не минава, смята още преподавателят. Евгений Дайнов вижда зад всичко това и "скрита картина". По думите му тя показва, че хора от типа на Борисов, Пеевски и подобните на тях, които Петков изброява, "колкото и пари да поглъщат, дължат много повече на доста неприятни типове". Затова, когато те не са във властта, "изпадат в истерия". По отношение на въпроса дали публичните твърдения на премиера за това кой стои в основата на корупцията в България са премерени или не, политологът не вижда нищо нередно: "Назовава нещата, които всички ние с години си говорим и знаем", казва Дайнов.

Нека за миг да си представим, че сме на мястото на новите управляващи. Малко след вземането на властта имаме да потушаваме пожар на няколко фронта едновременно - първо с високите цени на тока и газа, второ, заради липсата на приет бюджет и работещи схеми за компенсации. В същото време сменяме ръководствата на Комисията за енергийно и водно регулиране, както и на "Булгаргаз", където пък започва и разследване, но на миналото ръководство. И за капак кметове от ГЕРБ задружно се оплакват от високите сметки за ток и спират лампите. Зад всичко това реално стоят политически въпроси. В епизода ще чуете още: Как новото правителство реагира на кампанията на ГЕРБ и спирането на тока в редица общини и защо това може да е капан?

Какви ходове предприе досега четворната коалиция и какви ходове ѝ остават? Доколко цената на газа зависи от решенията на "Булгаргаз", след като вътрешният министър Бойко Рашков обяви, че започва разследване срещу държавната компания? Какви са вариантите пред новия шеф на "Булгаргаз" Людмил Йоцов и следователно опасенията? Какво представлява "връщането на посредниците в бизнеса", които купуват газа от Русия и го препродават на вътрешния пазар? Какво стана с компенсациите за високите сметки на бизнеса и общините?

koaning · October 6, 2023, 11:46am

Do you have the original examples.jsonl file around? It seems like example #32 is a culprit.

If you want to share the full file you can send it to vincent@explosion.ai. Might be the easiest for me to just try and explore the full dataset. I'll delete said file from my end once this thread is solved.

ivo.s · October 6, 2023, 12:31pm

I've send the original data in an email. The only caveat is that the order in the original file doesn't seem to be the same as the order that I'm getting when iterating from the DB. Maybe after the initial annotation or during the review, it somehow got mixed up?

koaning · October 9, 2023, 11:14am

I have tried the examples in your email but could not replicate the issue.

I first loaded all the data in.

python -m prodigy db-in issue-db-in issue-6826.txt

And then was able to print it all out by running:

python -m prodigy db-out issue-db-in

Do you have the ner.manual call that you've used? Part of me is wondering if the issue is in the label name instead of the text? Alternatively you could also try to edit the local internal code to print something that might help us debug.

Notice this error:

  File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/srsly/_json_api.py", line 39, in json_loads
    return ujson.loads(data)
ValueError: Unterminated unicode escape sequence when decoding 'string'

Could you try running the command again but by adding print(data) on line 39 in the /Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/srsly/_json_api.py file? Theoretically that should print the whole thing before it is turned into JSON. Hopefully that'll help us reproduce.

ivo.s · October 9, 2023, 2:05pm

Thanks for taking the time to check that,@koaning!

I didn't actually annotate it using ner.manual, but instead I used the WikiANN silver annotation workflow as suggested by Lj Miranda in here. Using the predictions from WikiANN model, two annotators corrected the silver annotations – I used the ner.correct recipe for that, this the exact call :

python -m prodigy ner.correct ner_news_updated ./data/silver_annotations_model/ ./data/new_train.txt --update --unsegmented

I did what you suggested and I was able to print everything until the error and I narrowed it down to the example that returns the error. This is the last lines of the output before the error :

{"text":"\\u0441\\u043b\\u0435\\u0434\\u044f\\u0442","start":522,"end":528,"id":101,"ws":true},{"text":"\\u0441\\u0442\\u0430\\u0442\\u0438\\u0441\\u0442\\u0438\\u043a\\u0438\\u0442\\u0435","start":529,"end":541,"id":102,"ws":true},{"text":"\\u0437\\u0430","start":542,"end":544,"id":103,"ws":true},{"text":"\\u0441\\u043c\\u044a\\u0'

And this is the actual example:

"Анализът е препубликуван от "Тоест". "Шоуто трябва да продължи" - със заглавието на финалната песен от последния албум на Queen преди смъртта на Фреди Меркюри може да се обобщи същността на шоубизнеса. Още неизлезли от пандемията от COVID-19, виждаме началото на война, която ни засяга пряко. Ала колкото по-мрачни са перспективите, толкова по-силна е потребността за бягство от реалността под формата на забавление. Европа на 20-те и 30-те години на ХХ век знае най-добре. Затова не е учудващо, че вместо всекидневно да следят статистиките за смъртността от коронавируса или да се питат откъде са се появили танковете в Донбас, мнозина у нас се занимават с един ерген и потенциалните му избранички. Впрочем няма пречка човек да обсъжда с еднаква по интензитет страст и това, и сериозните обществени проблеми. Новото риалити шоу по bTV "Ергенът" предизвика реакции дори у хора, които не са го гледали. Концепцията му е следната: 22 жени се борят за сърцето на един мъж. Онези, които не получат от ергена роза, отпадат докато не остане избраницата на главния герой. Получилата "специалната" роза е фаворитката на ергена за конкретната серия. Борбата се провежда на екзотично-романтичен фон - в луксозен частен комплекс в Турция."

At the end it seems like the end of the string is incomplete –\\u0, but I have no idea what is causing that, as the word in question looks ok in Cyrillic (коронавируса).

Since I now know which example is causing the error – is this something that I could somehow drop from the reviewed dataset easily?

Many thanks for your help

koaning · October 9, 2023, 4:01pm

The Prodigy API assumes append-only operations but you should be able to use your favorite SQL tool to connect to the database and delete the example manually. That seems like the easiest approach.

Let me know if that does not work or if you need further aid!

ivo.s · October 12, 2023, 9:36am

Quick update: So I manually removed ~100 examples from my dataset that for some reason where saved partially (examples below). And while doing that I remembered that I may have caused that myself, because when I started doing the review, I couldn’t save my annotations and got this - ValueError: Can't add JSON example to MySQL database: blob is longer than 65535 characters. This can lead to MySQL truncating your data. If possible, segment your examples into smaller chunks or limit what you're including in the data. You can also change the field type to mediumblob and set the environment variable PRODIGY_MYSQL_MAX_LEN=16777215

And so I did export PRODIGY_MYSQL_MAX_LEN=16777215 which enabled me to continue annotating. Could that have prevented some longer examples of being saved? And are my examples simply too long? Seems like Cyrillic texts might take more space, because of the encoding?

Example of partially saved example:

(8365, 138592518, 1960406803, b'{"text":"\\u042e\\u0440\\u0438\\u0441\\u0442\\u044a\\u0442 \\u0414\\u0438\\u043c\\u0438\\u0442\\u044a\\u0440 \\u0414\\u0435\\u043b\\u0447\\u0435\\u0 ... (70591 characters truncated) ... tamp":1694606770,"_annotator_id":"ner_news_updated-ivo","_session_id":"ner_news_updated-ivo","sessions":["ner_news_updated-ivo"],"default":false}],"v')

Correctly saved:

(8366, 489287580, -1326836268, b'{"text":"\\"\\u0418\\u0441\\u043a\\u0430\\u043c\\u0435 \\u043e\\u0442 \\u043f\\u0440\\u0435\\u043c\\u0438\\u0435\\u0440\\u0430 \\u0434\\u0430 \\u044 ... (44490 characters truncated) ... notator_id":"ner_news_updated-ivo","_session_id":"ner_news_updated-ivo","sessions":["ner_news_updated-ivo"],"default":false}],"view_id":"ner_manual"}')

Thanks a lot for the help once more @koaning, I’ve managed to export my data and train my model!

Topic		Replies	Views
UnicodeEncodeError during training ner , spacy , solved	6	2091	November 13, 2018
train ner dataset -> ValueError: too many values to unpack ner , done	6	2620	January 10, 2020
How can I correct my annotations using the NER.manual recipe?	5	251	May 22, 2023
UnicodeDecodeError while training japanese model usage , spacy	3	1663	February 6, 2019
UnicodeDecodeError:	12	384	November 1, 2022

ValueError when trying to export Cyrillic annotations from a reviewed dataset

Related topics