Unexpected character in found when decoding object value

Hi,

Prodigy 1.7 work fine with Sqlite, but when i try with Postgres, i have this error :

Traceback (most recent call last):
File "/produit/anaconda/anaconda353/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/produit/anaconda/anaconda353/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/produit/anaconda/anaconda353/lib/python3.7/site-packages/prodigy/main.py", line 331, in
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 224, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "cython_src/prodigy/util.pyx", line 49, in prodigy.util.get_config
File "cython_src/prodigy/util.pyx", line 469, in prodigy.util.read_json
File "cython_src/prodigy/util.pyx", line 470, in prodigy.util.read_json
ValueError: Unexpected character in found when decoding object value

What is wrong ?

Hi! What command were you running that caused this error?

Based on the message and traceback, it looks like it's trying to load content from a JSON file, but the file contents aren't valid JSON. Are you loading in patterns or input data from JSON and if so, did you double check that it's valid?

I'd be very surprised if this was related to Postgres – the database is really only used to save annotations and is completely separate from any processes that load data etc.

I launch this one :
prodigy ner.teach mes_donnees en_core_web_sm phrases.jsonl

and then, my json content (with no linefeed at teh end of the file) :

{"text": "C'est lundi"}
{"text": "C'est mardi"}
{"text": "C'est une belle journée"}
{"text": "C'est chouette la neige"}
{"text": "C'est pas vrai"}
{"text": "Il fait beau"}
{"text": "C'est noel"}
{"text": "C'est l'été"}
{"text": "Il fait beau en été et froid en hiver"}
{"text": "C'est lundi"}
{"text": "C'est lundi"}
{"text": "C'est lundi"}
{"text": "C'est lundi"}
{"text": "C'est lundi"}
{"text": "C'est lundi"}

What does your prodigy.json look like? Just checked and pretty sure the error occurs when Prodigy is loading the configuration from JSON. When you added the database details, maybe you accidentally introduced a bug or invalid JSON in your config file?

Thank you for your reply. I had a misplaced comma in the prodijy.json for Postgres connection.

Best regards.

1 Like

I had a similar issue but the cause was different. We hit the limit of the blob size in MySQL.

The command I ran:

prodigy train ner hc_18052020_GOLD data/ecommerce/2020_04/tmp_model_nc_150520_v2  --eval-split 0.3 --n-iter 50 --output /data/ecommerce/2020_04/hc_18052020_gold

The error I got:

âś” Loaded model
'/home/ubuntu/prodigy/data/ecommerce/2020_04/tmp_model_nonclaims_150520_v2'
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/recipes/train.py", line 103, in train
    data, labels = merge_data(nlp, **merge_cfg)
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/recipes/train.py", line 359, in merge_data
    ner_examples = load_examples(DB, ner_datasets)
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/recipes/train.py", line 528, in load_examples
    examples = db.get_dataset(set_id) or []
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/components/db.py", line 337, in get_dataset
    return [eg.load() for eg in examples]
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/components/db.py", line 337, in <listcomp>
    return [eg.load() for eg in examples]
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/components/db.py", line 99, in load
    return srsly.json_loads(content)
  File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/srsly/_json_api.py", line 38, in json_loads
    return ujson.loads(data)
ValueError: Unexpected character in found when decoding object value

I was unable to load a dataset for modeling, which was previously saved by prodigy to the MySQL database. On checking the individual annotations, I found that the size of the text plus annotations exceeded 65KB (the max size for LOB store in MySQL) due to which MySQL truncated the JSON, resulting in a bad JSON record.

This was not obvious to me when saving the annotation for the text, but the issue appeared when I tried to load the saved annotation to build a model.

This was how my JSON looked like after the annotation.

{"text":"pure nv balancing conditioner: infused with argan oil, keratin, collagen, natural vitamins, and lavender for smoother, more manageable hair- sulfate & sodium chloride free (33.8 oz bottle). deep conditions-pure nv balancing conditioner is formulated to improve the appearance and feel of your hair by moisturizing your dry damaged locks building body, improving luster and ..", "spans":[...], ..., "tokens": [..., {"text":"hair"

The JSON ended abruptly. When I looked at the size of the JSON, it was 65535 characters.

Is there a work around for this? Should I break up the size of the input documents? Or move to a different db?

@ramhari7 Ah, this actually came up the other day in this thread using MySQL as well, and I was surprised that the MySQL database and/or peewee didn't raise an error here when the example was saved and truncated.

If you can use shorter text, that's always a plus. As I mentioned in the other thread, especially if you're doing NER, there's not really a benefit in annotating whole large documents because you're typically training models with much narrower context windows anyways. So working with shorter examples lets you collect more datapoints and makes the resulting annotations easier to work with in general.

(I still want to get to the bottom of the MySQL situation, though. If it doesn't do it by default, maybe we can add some logic that at least outputs a warning if a database length limit is hit.)

1 Like

A warning would really be helpful.

I came across this error/limitation too today. No warnings or errors saving the annotations but cannot export the data with db-out, or the python DB api. I got suspicious when debugging and seeing the length of the string being 65 535...

I wanted to add my annotator reasoning (based on limited experience), using both ner.manual and ner.correct, it was helpful to have full articles to disambiguate between (for instance) a CEO and an ORG. Frequently in our corpus, the CEO is referred to by surname if having been introduced previously, and there are many companies that have surnames as company names. So, I found myself wanting to go back to check and resorted to the --unsegmented argument to ner.correct. Is there a better way to achieve this context?

Also, I have quite a few annotated articles that might be voided now -- is there a workaround for this?

In case this helps someone else, the database change (NOTE : I'm using MySQL) to accept longer snippets is
ALTER TABLE example MODIFY content mediumblob;

To clean up the garbled blobs (I'm hard coding accept here):

DELETE l from link l inner join example e on l.example_id=e.id where e.content not like '%"accept"}' and LENGTH(e.content) = 65535;
DELETE from example where content not like '%"accept"}' and LENGTH(content) = 65535;

1 Like

Thanks for the update, that's good to know! For v1.10, we'll add a check to the database handler before an example is added to the database, and it will raise if the database used is MySQL and the blob is too long.

For Prodigy v2., we'll be migrating the database to SQLALchemy and it will also include various improvements to the way the data is stored. But this will require database migration, so it's really only something we can do for a major version.

If you don't have to go too far back, you could annotate smaller chunks (like, 2 paragraphs at a time) and then go back and undo if you want to check a previous example. However, if the disambiguation is very difficult based on the local context, it can still be possible that it'll be harder for the model to learn and you'll see worse reults later on.

Maybe you just want to implement your own segmentation strategy that's somewhere between sentences and the full texts? Like, a couple of paragraphs or any other logical unit. And then you can feed the pre-segmented examples into Prodigy.

Thank you for the reply, @ines!

Yes, I missed pointing out I use a MySQL database, I'll update the post to clarify!

I understand that if I have to go back far in the model to disambiguate then the model which doesn't use that much context will struggle -- this makes perfect sense!

However, for us, the article is an natural boundary, also helps to understand how much data was used for training/testing. Except for the error in the database, are there other reasons to split these up? Would it make any differences for the training down stream? I think the training data is shuffled for each epoch, so maybe this would be a difference especially if the articles are really long and not that many.

Aside from that, I would be curious of you have any recommendations in how to solve the disambiguation between PERSON and ORG for the model? So, for instance:

''xxxxx", said Siemens USA CEO Barbara Humpton. [Barbara Humpton correctly annotated as a PERSON]
...
Humpton has made a bid for X. [Humpton incorrectly annotated as an ORG].

My current thinking is to post-process to programmatically look for clashes and use that as a basis for removing incorrect ORG label. Maybe even supported by probabilities from entity linking, especially to avoid removing instances where Humpton is an ORG too! Is this type of post-processing the correct path to go?
Thanks!

It might, depending on the settings. spaCy has support for treating longer sequences that span multiple sentences as a single instance. This is nice for cases where the sentence boundaries aren't so reliable, or the document doesn't divide necessarily into just sentences. This works well for instances that are about a paragraph or so, but the model might struggle if the documents are really long.

Just released Prodigy v1.10, which will now show an error if the content length exceeds 65535 characters for MySQL databases. This ensures that the database/peewee doesn't just silently store truncated examples. In v2, the constraint won't be a problem anymore, as we'll restructure the database and migrate to SQLAlchemy.

1 Like

Thank you for laying out potential risks with longer documents. I've come across incorrect segmentations quite a bit in our corpus, we have a lot of tickers in the form of Exchange:Ticker and often parenthesis, so that was another reason I started using --unsegmented in ner.correct leading to long documents. However, I think this might just be fooling myself as at the inference stage the same tokenizer is going to be used.

When you say that spaCy has support for treating multiple sentences as a single instance, is that an option somewhere? I have not been able to find it -- still somewhat new to spacy/prodigy.

It's just a question of how you process your text before you pass it into spaCy. You can have one sentence per Doc object, or a Doc object could include multiple sentences. You might find the textacy library helpful for preprocessing.

1 Like

Hello @ines, I might be wrong but I was expecting, based on the error message that after I perform the ALTER, saving the annotations will work as limit in the DB is 16777215.

ValueError: Can't add JSON example to MySQL database: blob is longer than 65535 characters. This can lead to MySQL truncating your data. If possible, segment your examples into smaller chunks or limit what you're including in the data. You can also change the field type to mediumblob:
ALTER TABLE example MODIFY content mediumblob;

Ohh, damn, we added that ALTER TABLE suggestion before we decided to actually make it raise an error instead of a warning :woman_facepalming: And at the moment, it's not actually checking for the real DB limit. I'll add an option for configuring the limit via an environment variable for the next version – I think this gives people the most flexibility.

As a quick workaround in the meantime: find your Prodigy installation (e.g. by running prodigy stats), open prodigy/db.py and remove the block that raises the error!

2 Likes

Just realised I forgot to update this thread: v1.10.2 now lets you set the PRODIGY_MYSQL_MAX_LEN environment variable to define the maximum allowed length, e.g. after manually configuring your tables. If the text is longer than that, Prodigy will raise an error.

1 Like