Getting corpora into Prodigy

usage
solved

(Stephan De Spiegeleire) #1

[Total noob here, as will become abundantly clear :slight_smile: ]. We are a small research team that have a set of text corpora containing between a few 100s and 10s of 1000s of (mostly) academic books and articles on different research topics. We have them all stored in different libraries in Zotero, a bibliographical management tool with a quite active community as well.
Zotero, which sits on top of a sqlite db, has all bibliographical metadata (including a few text-based ones like the abstract of the articles) AND - in many cases - also the full text of these books/articles in the full-text cache that Zotero creates when it indexes attached pdfs. That text is then stored, together with the pdf itself (or sometimes also html, if that is the format of the article) in mutliple ‘storage’ folders which each have a UID, which is referenced in the sqlite db.
We now (sort of) know how to export those documents with their bibliographical data AND their full text. But we’d like to get instructions on what the best and most efficient way is to get all of these into prodigy to start the training… Oh and we work (mostly) in Windows.


(Ines Montani) #2

Hi! The good news is, Prodigy doesn’t require you to import raw data before you can get started. When you run a recipe script on the command line, you can pass it the path to a file containing the texts you want to label. For example, if you want to label PERSON by hand, the command could look like this:

prodigy ner.manual person_dataset en_core_web_sm /path/to/data.jsonl --label PERSON

Prodigy can load several different file types (plain text, CSV, JSON, JSONL), but for your use case with large corpora and lots of metadata, I’d definitely recommend JSONL (newline-delimited JSON). Even outside of Prodigy and our stack, I think it’s just a really convenient format. You get the flexibility of JSON and you can still stream your data in line by line.

At a minimum, each record in the JSONL needs to have a "text" key. But you can also include your own custom properties to store metadata with the examples. Those will be passed through and saved with the collected annotations, so you’ll always have a reference to the original corpus. You can also find more details on the input formats in your PRODIGY_README.html.


(Stephan De Spiegeleire) #3

Thanks for that prompt reply. I was unaware of that format, Do you also have any recommendations for how to get our csv or json files in that new format? The PRODIGY_README.html explains the format (and yes, the csv files are a huge nightmare) and some examples. But it does not recommend any tools. And when I find out more I see three different versions

  • Line-delimited JSON (LDJSON),
  • newline-delimited JSON (NDJSON), and
  • JSON lines (JSONL)

So how would you recommend we get our information into that jsonl-format?

Which of these do you recommend?


(Ines Montani) #4

Ultimately, JSONL is a file where each line contains a JSON object. So like this:

{"text": "Some text"}
{"text": "Some text"}

How you create that is up to you – but it should be pretty straightforward in most languages. For example, in Python, what’s needed is essentially this:

data = [json.dumps(line) for line in lines]
with open(some_file) as f:
    f.write('\n'.join(data))

Prodigy also comes with utility functions for this that you can use if you like:

from prodigy.util import read_jsonl, write_jsonl

write_jsonl('/path/to/file.jsonl', lines)
lines = read_jsonl('/path/to/file')

There’s also the jsonlines library that has a bunch of JSONL-specific features:

https://jsonlines.readthedocs.io/en/latest/


(Stephan De Spiegeleire) #5

Wow! you guys ARE great. We’re on it! :slight_smile: Thanks Ines