Getting corpora into Prodigy

sdspieg · February 8, 2019, 12:09am

[Total noob here, as will become abundantly clear ]. We are a small research team that have a set of text corpora containing between a few 100s and 10s of 1000s of (mostly) academic books and articles on different research topics. We have them all stored in different libraries in Zotero, a bibliographical management tool with a quite active community as well.
Zotero, which sits on top of a sqlite db, has all bibliographical metadata (including a few text-based ones like the abstract of the articles) AND - in many cases - also the full text of these books/articles in the full-text cache that Zotero creates when it indexes attached pdfs. That text is then stored, together with the pdf itself (or sometimes also html, if that is the format of the article) in mutliple ‘storage’ folders which each have a UID, which is referenced in the sqlite db.
We now (sort of) know how to export those documents with their bibliographical data AND their full text. But we’d like to get instructions on what the best and most efficient way is to get all of these into prodigy to start the training… Oh and we work (mostly) in Windows.

ines · February 8, 2019, 1:20am

Hi! The good news is, Prodigy doesn't require you to import raw data before you can get started. When you run a recipe script on the command line, you can pass it the path to a file containing the texts you want to label. For example, if you want to label PERSON by hand, the command could look like this:

prodigy ner.manual person_dataset en_core_web_sm /path/to/data.jsonl --label PERSON

Prodigy can load several different file types (plain text, CSV, JSON, JSONL), but for your use case with large corpora and lots of metadata, I'd definitely recommend JSONL (newline-delimited JSON). Even outside of Prodigy and our stack, I think it's just a really convenient format. You get the flexibility of JSON and you can still stream your data in line by line.

At a minimum, each record in the JSONL needs to have a "text" key. But you can also include your own custom properties to store metadata with the examples. Those will be passed through and saved with the collected annotations, so you'll always have a reference to the original corpus. You can also find more details on the input formats in your PRODIGY_README.html.

sdspieg · February 8, 2019, 1:50am

Thanks for that prompt reply. I was unaware of that format, Do you also have any recommendations for how to get our csv or json files in that new format? The PRODIGY_README.html explains the format (and yes, the csv files are a huge nightmare) and some examples. But it does not recommend any tools. And when I find out more I see three different versions

Line-delimited JSON (LDJSON),
newline-delimited JSON (NDJSON), and
JSON lines (JSONL)

So how would you recommend we get our information into that jsonl-format?

Which of these do you recommend?

ines · February 8, 2019, 11:49am

Ultimately, JSONL is a file where each line contains a JSON object. So like this:

{"text": "Some text"}
{"text": "Some text"}

How you create that is up to you – but it should be pretty straightforward in most languages. For example, in Python, what’s needed is essentially this:

data = [json.dumps(line) for line in lines]
with open(some_file) as f:
    f.write('\n'.join(data))

Prodigy also comes with utility functions for this that you can use if you like:

from prodigy.util import read_jsonl, write_jsonl

write_jsonl('/path/to/file.jsonl', lines)
lines = read_jsonl('/path/to/file')

There’s also the jsonlines library that has a bunch of JSONL-specific features:

https://jsonlines.readthedocs.io/en/latest/

sdspieg · February 8, 2019, 11:51am

Wow! you guys ARE great. We’re on it! Thanks Ines

Topic		Replies	Views
Create a dataset out of many txt_files documents (Best Practice) usage , ner , best-practices	4	1815	March 30, 2021
Names only for annotation project usage , ner	1	356	May 8, 2021
Using prodigy with PDF documents usage	3	4769	February 20, 2018
Re-labling custom dataset with Prodigy usage , ner	2	606	June 28, 2021
Data prep Getting Started usage	2	546	April 26, 2022

Getting corpora into Prodigy

Related topics