You can try using the pandas library as it has a to_json method that will write it out into JSONL format (just ensure that you're passing True to the lines parameter). You can check this StackOverflow answer for more information.
JSON and JSONL (newline-delimited JSON) are both fine input formats and I think there's only one tiny problem in your input file, otherwise it looks good. If you look closely at the structure, it currently looks like this:
{"text": [{"text": "blah"}, ...]}
But you want it to be just this:
[{"text": "blah"}, ...]
So if you're saving it out in Python, doing something like data["text"] should give you just the list. And then you can save that to a .json file.
Can you share the raw JSON it generated (instead of the visual preview)? Maybe pandas ended up actually exporting it as a dict with keys 0 instead of a list, or something like that?
The second version is definitely not correct, because you want a list of dictionaries with the key "text", e.g. [{"text": "..."}].
For the code, you only need to have a text column in your pandas DataFrame that contains exactly your text. Assuming you have a CSV file where each line is a text:
# test.csv
Welcome to Prodigy!
I love playing baseball
My brother went to the library
You can then convert them using this script:
import pandas as pd
# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)
# Convert to JSONL
jsonl_str = df.to_json(orient="records", lines=True)
# Inspect JSONL string
print(jsonl_str)
# Create a file path to the intended json file
jsonFilePath = 'trial3.json'
# create new json file and write data on it
with open(jsonFilePath, 'w') as jsonFile:
# make it more readable and pretty
jsonFile.write(jsonlst)
Now when i open the json file through the web browser i get the error:
SyntaxError: JSON.parse: unexpected non-whitespace character after JSON data at line 2 column 1 of the JSON data
Oh, have you tried the pandas dataframe suggestion above? We might have replied at the same time Saving with the json module can sometimes be tricky. Atleast with the dataframe approach, it should be handled already.
@ljvmiranda921 this worked when I had the df as a dictinary. When i tried it again after following your suggestion I get the: str object has no attribute 'to_json'
In this approach, we already save the file right away into JSONL. We don't need to go through its "string" representation. The output should already be a file in your Windows folder
The reason why we had the error str object has no attribute to_json is because we don't have a direct way of converting strings into a JSONL file, we need to jump a few hoops (like importing the json package, etc. etc.).
However, if we try to use the df.to_json function and supply the filepath in one of the parameters, we skipped the hassle and have our file right away
Ok sure! Let us know how it went After cleaning your CSV, you can try this again:
(Assuming your CSV looks like this)
# test.csv
Welcome to Prodigy!
I love playing baseball
My brother went to the library
You can try this
import pandas as pd
# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)
df.to_json(
r'C:\Users\b1075161\Documents\Prodigy\prodogy_files\trial.jsonl',
orient="records",
lines=True
)
@ljvmiranda921 i think the problem was that I kept saving as json instead of jsonl . One of the problems so to say.
Now i have another issue
When i load the jsonl file to prodigy it says:
Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.
I am getting this error after running this:
!python -m prodigy ner.teach test en_core_web_trf trial8.jsonl
My csv is similar to the one you gave. Could it be maybe a problem of using a pretrained spaCy model?
My main goal is to train new named entities on an empty model.
Hi @Zim1-finest ! It's not about the pretrained spaCy models.
The problem is about the column names. Try this step.
Notice the rename step there. We're renaming column 0 with text, so that once it's saved into JSONL, the text column shows up:
import pandas as pd
# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)
df = df.rename(columns={0:"text"})
df.to_json(
r'C:\Users\b1075161\Documents\Prodigy\prodogy_files\trial.jsonl',
orient="records",
lines=True
)
HI!, I HAVE MY JOB.JSONL FILE AS THE BELOW ONE AND I THINK IT IS IS NOT AS THE CORRECT FORM OF JOB.JSONL
{"0":"Designation","1":"CompanyName","2":"CompanyLocation","3":"JobSummary","4":"PostedDate","5":"Salary"}
{"0":"newSr. Backend Developer","1":"RED TECHNOLOGIES","2":"Charlotte, NC 28203 (Dilworth area)","3":"The Sr. Backend Developer is responsible for work in all stages of the development life cycle - reviewing business requirements, design, construction, testing,\u00e2\u0080\u00a6","4":"PostedToday","5":"None"}