I am facing an issue with prodigy. I am training on a dataset containing 871725 sentences each containing on average of 20-30 words.
I am using python 3.6.5(anaconda) on a Mac. Prodigy is taking more than an hour for it to learn about one custom entity type when I run the ner.teach command.
Hi! Could you elaborate on this? What exactly is taking an hour? Collecting the annotations using ner.teach? Or training a model on the dataset using ner.batch-train? Or just starting the Prodigy server?
Hi, It is taking an hour when I use ner.teach for collecting the annotations. when I run the command, it is taking 1 hour time to finally get the following on my prompt
✨ Starting the web server at http://localhost:8080 ...
Ohhh, okay – thanks for the clarification. I think I know what the problem is.
What format does your dataset have and are the sentences all one string? When you load in the data from a text file, Prodigy will try to stream it in line by line and split each text into individual sentences. This is usually quite fast, because it can be processed as a stream – but if the first (and only) line consists of 800k sentences, Prodigy has to read it all in, process it all with spaCy and split it into sentences before you can get started. This needs a lot of memory, which your machine likely doesn’t have.
So if that’s what’s going on, try to provide your texts in a format that can be read in line-by-line, e.g. .txt or JSONL (newline-delimited JSON) with one sentence or paragraph per line.
Could you share the exact command you ran? And if you set the environment variable PRODIGY_LOGGING=basic when you run ner.teach, is there anything in the log that looks suspicious? Does it get stuck, does it keep looping over something?
One potential problem I see here: You’re only annotating one label outboard, which is a new label that the existing model knows nothing about yet. So in the beginning, all suggestions you see will be based on the patterns. If Prodigy can only find very few matches in your data, it might take a while to generate the first batch of 10 suggestions.
So as a solution, you could either use a more specific set of texts that will produce more matches, or adjust the patterns to make sure you can start off with enough examples. If you already have some annotations for your outboard label, you could also pre-train the model to make sure it at least predicts something for that category, which you can then accept/reject, to get over the cold start.