Create a dataset out of many txt_files documents (Best Practice)

I'm working on a NER task, similar to the describe here.

It my first Prodigy project, so have some "new user" questions

My data:
I Have about 1K-2K txt documents (file.txt format).
I read that it is better to use jsonl, but didn't fully get why.
I started by converting all documents into a single big jsnol and put extract some meta data into "meta" attribute.
I would probably want to split the dataset into train/val/test and be aware to the source file in the future.
Should I just write my own scripts for that, or are there some train_test_split utils (just like in sklearn api)?

Automatic labeling:
For now, I have 3 classes, and about 100 unique keywords which should always be labeled by their class.
So I planned to use patterns to automatic label all my documents.
I know I can use"ner.manual" with "--pattern" and just press accept again and again as fast as I can, but I want to use some shortcut like implementing some "-accept-all". Any hints how to do it will be wonderful :slight_smile:
(The labeling require domain knowledge experts, so for that stage, I just want to accept all)

1 Like

Hi! The main reason is that JSON is just more flexible and lets you attach meta information, represent nested data, different types of values (strings, integers, lists) and so on. .txt is just plain text – so once your data is more than just plain text, it becomes harder to represent it as plain text (without needing your own conversion logic). JSON on the other hand is super standard, and you can load it in pretty much any common programming language.

For NLP specifically, .txt files that are read in line-by-line also make it more difficult to represent newlines. So if your examples contain newlines and you don't want to split them, you need to come up with a different way to read them in (maybe two newlines? but then you can't have examples with two newlines). In JSON, that's more straightforward: "text": "hello\nworld\n\nthis is text".

About the JSON vs. JSONL (newline-delimited JSON) distinction: one problem that JSON has is that you typically need to parse the whole file when you load it. That's inconvenient for large corpora because it means you need to load everything into memory and can't stream your data line-by-line. JSONL is basically just one JSON object per line, so it has the flexibility of JSON, and can be streamed in.

It's definitely a good idea to do this properly from the beginning and set up your data with a dedicated test/dev set for evaluation :100: If you do this before annotation, you can just do it as a preprocessing step using your own logic in your own script.

(If you train with Prodigy's wrapper around spaCy later on, it does have a feature to automatically hold back some data for evaluation – but this is just intended for quick experiments if you don't yet have an evaluation set. We always recommend using a dedicated evaluation set if possible!)

Okay, so in the first step, you just want to accept all entities highlighted by your patterns and add them to your data, without reviewing them in the UI? In that case, you don't even need Prodigy yet and it'll probably be more efficient to do this directly with spaCy's Matcher or PhraseMatcher (which is also what Prodigy uses under the hood):

Just make sure to filter your pattern matches so they don't overlap (because that'd be invalid for NER). spaCy has a utility for that:

You can then save the extracted entities in the following format, with a "start" (span.start_char), "end" (span.end_char) and "label" (span.label_) value: This lets you load the data back into Prodigy later to correct it, add more labels by your domain experts etc.


Thank you for that detailed answer! All clear now and I implemented the suggested easily.

  • JSONL - totally make sense now. I love it! I used the Matcher it was pretty easy to use!
  • I started by just using the the default prodigy train ... --eval-split 0.2, but I will take your advice and see I can use --eval-id instead :upside_down_face:

Haha and it so awesome to achieve those answers, so I would try another one which might be less for that chanel and more for my specific project..
I'm a Master student, took a NLP class and Im working on some final-project.
I chose an interesting practical task (hope to publish the dataset later on for free usage) and for the labeling part Im sure Prodigy is the best product right now..
But, an academic work should include some "NLP research": training and evaluating different models and report by some paper all the finding. I planed to use the prodigy train interface, but not sure yet if it was a good idea.

So my question is - would you advice to use prodigy for the training and evaluation parts?

For now, I only ran (super easily!) those models, the printed results were informative but I didn't find details of the model architectures, and I'm not sure it going to be that easy to integrate with non-Spacy models + read yesterday that bold spaCy is not research software warning, so started wandering if I made a non-ideal / wrong decision.
More to say, my project is just a small-medium project, I will finish the (first version of the) project this month. So, I'm not planning to do super complicated stuff..

Lastly, thanks again for the awesome packages, video-tutorials, docs and answers (Beside school, I'm a developer and It's definitely one of the best user-experience I had with such software products)!

Ah cool! Dataset work is often underappreciated so it's cool to see you're taking this on – definitely keep us updated on the progress :blush:

If your goal is to publish a paper on your work, you probably want to train in a more "standard" way, for example, using spaCy directly. Prodigy's train command is mostly a wrapper around spaCy and its goal is to make it easy to run quick experiments from your datasets without leaving Prodigy. But once you're serious about training, you usually want to train with the ML/NLP library directly. That also makes it much easier to explain what's going on, and will allow others to easily reproduce your results.

Creating your data with Prodigy sounds like a great plan, though – there have been several papers that published datasets created with Prodigy, and we've started collecting some of them on the forum: Topics tagged paper

The model's config includes all details on the components that were used and their model implementations. You can read more about the different model architectures here: There will also be a paper on spaCy v3 soon that you can cite.

What we mean by that is: spaCy is primarily designed for production use, to get things done and to ship applied NLP pipelines. So if your goal is to research different NLP algorithms, it might not be the best choice of library. However, it doesn't mean that researchers don't/can't/shouldn't use spaCy :sweat_smile: If you're working on a practical problem and you want to make the result easy for people to run and use, publishing it as a spaCy component, pipeline or project template is probably a great idea.

1 Like

Thanks for great answers as usual :slight_smile:

If (or when I hope haha) my project would grow into a public paper, I would love to cite Prodigy and Spacy + publish my project template include anything to reproduce easily!
BTW, I have also added some evaluation "by-word" instead of the deafault "by-entity" -> It was useful for my pipeline which involve pattern-based labeling, and might serve other people as well.

For a quick start haha, indeed it was the most easy for me to follow the QuickStart instructions.
I choose the default suggestions by the widget:

  1. CPU (NER only) composed of "tok2vec","ner"
  2. GPU (NER only) composed of "transformer","ner"

They both yield pretty impressive results for a quick start! (Honestly, they score better than any model which was evaluate for our NER-domain in the literature.. But those papers datasets are private, so it hard to tell if the models are strong or our dataset is easier)

My next step are:

  • Understanding those default model better. As you suggested, I started by reading my models config.cfg, and reading in for further description. But I wander:
    - Is there some paper explains deeply those specific default models? I describe them on my academic-document so it would be great to have a reference.
    - In particular maybe someone already review the strength and weakness of them compare to other models?
    - If published online, when presenting scores of those models "as is", is the spacy-citation is enough or are there another more specific citation to add?
  • Trying out new models
    In the literature for my NER-domain. I read about architectures of specific rule based (which I don't find interesting right now) CRF with embedding, CRF without embedding, LSTM and usage of both "wrod2vec" and "transformers". None of the researchers published the code.. So wanted to hear your suggestion:
    - How do you think it would be the easiest to implement more architectures (except of the 2 quickstart)?
    - Should I stick with the neat "config.cfg"? I did not find any examples online except of the quickstart and it is not intuative for me to just edit it directly.
    - Would it be easier and faster to switch for another lib for that case and wrap back into spacy just when I want to publish for other people?