Create a dataset out of many txt_files documents (Best Practice)

Hi! The main reason is that JSON is just more flexible and lets you attach meta information, represent nested data, different types of values (strings, integers, lists) and so on. .txt is just plain text – so once your data is more than just plain text, it becomes harder to represent it as plain text (without needing your own conversion logic). JSON on the other hand is super standard, and you can load it in pretty much any common programming language.

For NLP specifically, .txt files that are read in line-by-line also make it more difficult to represent newlines. So if your examples contain newlines and you don't want to split them, you need to come up with a different way to read them in (maybe two newlines? but then you can't have examples with two newlines). In JSON, that's more straightforward: "text": "hello\nworld\n\nthis is text".

About the JSON vs. JSONL (newline-delimited JSON) distinction: one problem that JSON has is that you typically need to parse the whole file when you load it. That's inconvenient for large corpora because it means you need to load everything into memory and can't stream your data line-by-line. JSONL is basically just one JSON object per line, so it has the flexibility of JSON, and can be streamed in.

It's definitely a good idea to do this properly from the beginning and set up your data with a dedicated test/dev set for evaluation :100: If you do this before annotation, you can just do it as a preprocessing step using your own logic in your own script.

(If you train with Prodigy's wrapper around spaCy later on, it does have a feature to automatically hold back some data for evaluation – but this is just intended for quick experiments if you don't yet have an evaluation set. We always recommend using a dedicated evaluation set if possible!)

Okay, so in the first step, you just want to accept all entities highlighted by your patterns and add them to your data, without reviewing them in the UI? In that case, you don't even need Prodigy yet and it'll probably be more efficient to do this directly with spaCy's Matcher or PhraseMatcher (which is also what Prodigy uses under the hood): Rule-based matching · spaCy Usage Documentation

Just make sure to filter your pattern matches so they don't overlap (because that'd be invalid for NER). spaCy has a utility for that: Top-level Functions · spaCy API Documentation

You can then save the extracted entities in the following format, with a "start" (span.start_char), "end" (span.end_char) and "label" (span.label_) value: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP This lets you load the data back into Prodigy later to correct it, add more labels by your domain experts etc.

2 Likes