After labeling 150 examples, the app freezes up. I have tried waiting for 30 minutes at a time before using ctrl-c to stop the app. The oddest part is that it always occurs at the 150th example. See screenshot below. See the second screenshot for CLI during the freeze. See screenshot 3 for CLI with PRODIGY_LOGGING=basic for debugging.
Additionally, now that I've labeled 150 examples, 3 times, you would think I should have 450 total labels, but as you can see in the screenshot, there are only 390 that survived the freezes.
Is there a solution or workaround for this problem?
I am using Windows 10 with Python 3.7, Prodigy v1.8.3, and spacy 2.1.8
32gb ram; 8 core i7 processor
My data.jsonl file contains 100,000 sentences
Thanks for the detailed report! This is interesting and might indicate that there are some examples in the data that potentially trip up the model, specifically the beam search used to retrieve all possible analyses for the given text. (This is also the reason ner.teach typically tries to segment text into sentences to make sure a single huge text doesn't cause problems.)
I think what the 150 means here is that the problem occurs in the 16th batch (assuming you're using the default batch size of 10). ner.teach will skip examples and pick the most relevant to ask you about – so even if you run the recipe twice, you may still end up in the same segment of the data around the 16th batch.
What's the data like and how long are the individual texts? Is it regular natural language, or quite specific with unusual formatting? It'd be interesting to find the potentially offending example. One thing you could do is find the last example you annotated in the raw data and then look at the next texts and see if there's anything in there that looks suspicious. For instance, a super long text that may be hard to segment.
Prodigy typically keeps the 10 latest examples on the client and then sends the answers back in batches. So depending on where you are in the batch cycle, it's possible to lose the latest annotations. However, here's a tip for next time: If you don't close the browser, the examples are still on the client. So you can always restart the recipe and server on the command line and then hit "Save" manually once it's back up.
Hello Ines. Thank you for the quick response. The intuition you provided was very helpful, and your guess appears to be spot on. The sentences in my JSONL file are already parsed using nltk sent_tokenize, but because there are a lot of tables in these documents, when the text is extracted from these documents, the tabular format is replaced with "\t" and "\n" and the output of the tabular data does not look anything like complete sentences. So, despite already being sentence tokenized, some of the "sentences" are up to 2,000 words long and difficult to segment any further than what is already done by nltk. I could remove the tabular data from the text, but many of the dates that I am trying to identify using NER are found within the tabular data. So, I'm hesitant to take it out.
Here's an example of the format of the tabular data "sentences" in the JSONL file. I replace the words with "fu bar" and replaced the numbers with "####" due to data sensitivity. Also, this example is cut short. The actual "sentence" goes on like this for 1,446 words:
"#### ~> ####\n1% Fu Bar\n5% Fu-in Fu Bar\n#### Fu to #### Fu Bar Fu (BAR)\nFu Bar\nFu Bar (1)\nFu-Bar (2)\nFu\nFu Bar\nFu Bar Fu Bar Fu (Fu Bar)\nFUBAR #FU##BAR\n###,###\n###,###\n###,###\n##,###\n###,###"
I am very grateful to you for your last suggestion. In addition to saving otherwise lost labels, your suggestion to restart the recipe and hit "Save" once it's back up seems to have helped me get past the troublesome sentences. I am still having some problems with freezing, but now I have a workaround to get past those sentences and continue labeling. After 7 or 8 sessions I was getting to the point where it was freezing up after the first batch and every time I restarted the recipe it would show me the same 10 examples to relabel. Your suggestion has fixed this most frustrating part.
Any other suggestions or workarounds to reduce freezing while still using the tabular data "sentences" would be very much appreciated. For example, would you think that splitting all long sentences in my training data and real-world data at a specific number of characters, would be a good way to avoid the freezing and still achieve comparable levels of accuracy? Also, I noticed that the documentation shows the use of the "en_core_web_sm" model. Is it likely that switching from the "en_core_web_md" to the "en_core_web_sm" model (or another model) would reduce freezing?
Thanks – glad I was able to help resolve the mystery!
Some more background on what's likely happening here: When you run ner.teach, Prodigy will essentially ask spaCy to produce all possible analyses of each text and their scores and then use those to pick the suggestions to ask you about. For regular sentences, this is no problem, because there are only so many analyses. However, for 1k+ tokens consisting of mostly random characters and numbers, the model will likely either predict nonsense or one of the several number-related entity types. So there are way too many possible analyses. (Also note that this is really only a problem in the beam search scenario where you're interested in all analyses – when you're just processing the text, the model should have no problem with texts like this.)
This is likely not going to make a difference. The larger models may be slightly more accurate overall – but more accurate on actual text. All models will likely produce all kinds of analyses for your garbled number examples, causing it to hang.
Adding a simple rule that just filters out texts over a certain length could be an easy fix, yes. Maybe you could even come up with some other logic that lets you filter out the tabular data more specifically – like, ratio of numbers vs. letters in the text, amount of newlines etc.
If you're dealing with a lot of noise in your data, you could also consider training a text classifier to detect the noise first and use it to filter the incoming texts (similar to the approach I describe in the FAQ video here).