db-in - Not a valid input file

Hi,

I am saving my data using srsly.write_jsonl, but still getting: Not a valid input file. I think it's because for some reason not all my text loads with double quotations.

For example, I use json.dumps to get it into this format

['{"text": "Slow service and food is really growing expensive for what you get. ", "meta": {"review_id": "170369"}}',
 '{"text": "Ill really not wanna eat at rocomamas middelburg the service was poor ", "meta": {"review_id": "170871"}}']

But then when it loads back, certain text loads with single quotes and some with double:

[{'text': "I am really not happy by the level of service I got from Panarotis Maponya mall on the 25 Oct 2017 I waited over an hour  , they took my order in incorrectly, I am really angry that even in 2017 people don't understand the standard of service they need to provide, Lindiwe tried to help and escalated to the manager ....I really appreciate what the manager tried to do but I really need to vent   because he made me wait, really the level of service from Panarottis is pathetic, from 20:20pm till 10:00Pm don't have what I ordered, I wanted my money back reversed into my account plus the tip I gave Shakes , I am not writing this because I want a free pizza but it about the principle and integrity, why did this guy make me pay for an order which was never placed and when I approached him ,he looked at me like he didn't know I ordered, some guy by the name of Mpilo Dube was really helpful with great service and he really did go an extra mile and I would like to compliment him for that. The manager left me alone until Mpilo intervene. So I m really not happy I love panarottis and the food, but I guess today they were not at their best. ",
  'meta': {'review_id': '170368'}},
 {'text': 'Slow service and food is really growing expensive for what you get. ',
  'meta': {'review_id': '170369'}}]

I'm really not sure why, and whether this is why prodigy is giving an error as well. Any advice would be appreciated.

I've managed to use json.dumps() on the text only, which results in it displaying in Prodigy like this:

The double quotes are included and escaped in the text. This feels like a bit of a work around....Not sure if its the only way to save and display it.

What's the data you're passing to srsly.write_jsonl? It should be a list of dictionaries, each representing one line. Each item should be a dictionary, not a string – otherwise, the result will look like the one you posted above.

data = [{"text": "hello"}, {"text": "world"}]
srsly.write_jsonl("./data.jsonl", data)

You shouldn't have to call json.dumps on the text as well – just json.dumps on each line (which srsly.write_jsonl is a shortcut for).

The result will look like this:

{"text": "hello"}
{"text": "world"}

That's just Python – it'll use single quotes by default, unless the ' character appears in the text, in which case it'll use "". How Python represents the quotes doesn't matter – the data just needs to be a valid dictionary.

Thanks, @ines...

So I think I'm saving the file correctly, but prodigy is doing something really strange now. I will load some new data to label, and then after a few samples I get this message:

Warning: filtered 55% of entries because they were duplicates. Only 147 items
were shown out of 325. You may want to deduplicate your dataset ahead of time to
get a better understanding of your dataset size.

Not even 147 items are shown before I get No tasks available. When I exit, resave and reload the .jsonl file with the remaining items that weren't shown, prodigy shows 30 - 100 more items and then the same thing happens all over again.

I am not using json.dumps anymore. This is the format for each text loaded:

{'text': 'Worse service...most arrogant service from Chad, the Manager ',
  'meta': {'review_id': '122005'}}

All review_ids are unique.

The duplicaiton check reports how many identical examples (that are filtered out by default) were found in the data, to give you a heads-up that something might be going wrong. So it looks like about half of the texts in the JSONL file are duplicates.

Which recipe are you using, and do you already have examples in the dataset? Questions that you've already answered will be filtered out by default, so you're not annotating the same example twice. And if you're using an active learning-powered recipe like textcat.teach, it's expected that examples are skipped: the recipe will try to find you the most relevant examples in a stream of batches and it will skip examples with very high/low scores in favour of uncertain predictions.

Thanks, @ines.

There definitely are not duplicates in the reviews I am uploading, although sometimes for very short reviews there are duplicates, like "Bad service", for example. But there are not nearly as many of those as the warning is implying.

Also, it doesn't show me as many as it says before No tasks available comes up. In this warning it said 147 items were shown, where only ~35 were actually shown. So that leads me to believe as well that something else may be wrong...