Effect of split_sents_threshold on NER training accuracy

Hello the team!

I couldn’t find any relevant information on the issue.
I’m dealing with detection of entities in very noisy long text, the entities I’m extracting (ingredients) are often split in two by newline characters. I’ve been annotating for NER task mostly with make-gold recipe for now. When training (batch-train) on the same dataset, I’ve had very different results by changing only the pre-processing option “split_sents_threshold”.

With default “split_sents_threshold”: top accuracy = 0.55
With “split_sents_threshold” = 200 I get a top accuracy = 0.83
With “split_sents_threshold” = 2000 I get a top accuracy = 0.44

Questions:

  • What is the logic behind very different results on varying split_sents_threshold?
  • How to find its optimal value easily?
  • Do I need to pre-process text in spacy with a special option to match the training option?
  • Do you have recommendations on dealing with such text input?

Extract sample text:

Threonine 1,569
Tryptophan
Tyrosine
688
1,260
372
395
18
Supplement Facts
Ingredients:Protein Blend (Ultra-low
Serving Size: 1 Scoop (249)
Temp Whey Protein Isolate Whey Protein
Servings Per Container: 30
Concentrate), Fructooligosaccharides
Xanthan Gum, Natural Vanilla, Steve
Amount Per % Daily
Serving Value
Calories
86
This product is not manufactured in a plantiem
Calories from Fat 0
Contains Whey Protein Isolate (milka
Total Fat
Whey Protein Concentrate (milk
0.00%
Saturated Fat 09 0.00%
Ceautamed Worldwide, LLC
Trans Fat
Og 0.00%
Boca Raton, FL 33487 866-409-4252

Thanks!

Your text is pretty different from the paragraph text the models were designed for, so I think you probably need to put a bit more effort into the pre-processing. I suspect rule-based work will go a long way for you, especially based on word lists. Most word types will either be an ingredient or not an ingredient regardless of context. If you can figure that out at the word-type level, you’ll be making the task a lot easier than it will be at the token level, because the model won’t be able to make that assumption, and can easily get misled.

I think you shouldn’t rely on the built-in split_sentences logic at all. Your data isn’t sentences, so the heuristics we have to segment paragraphs into sentences isn’t going to work smartly.

Here’s a quick suggestion, that may or may not work. It looks like mostly your data has newlines separating logic lines on the label, but some new-lines are not significant, and you’d rather replace them with spaces. You could train a pre-processing model that asks for each newline whether it’s a good segmenter, or a bad one. This model should be pretty quick to train. Once you have the model, you can run it over all your data in a batch process, to clean up the segmentation. Next you might want to do a quick verification step where you look at whole labels and mark whether they have any segmentation errors. You can then go back over those examples and fix them up.

Once you’ve got the formatting problem solved, it should be much easier to write rules to extract the information. You can still play the same trick again: learn a minimal category that lets you parse the data into a more meaningful structure. For instance, you could try assigning a type to each line: is it an ingredient line, a section separator like “Amount Per % Daily”, or a miscellaneous statement like “This product is not manufactured in a plant”.

Thanks Matt for the ideas, much appreciated. I’ll explore the suggestions.
To provide more details: a simple rule-based approach on cleaner data has shown lower performance than NER because some ingredients in the text come from the product’s marketing text or suggested recipe while what I want to extract is the ingredients of the label in question, so context and surrounding tokens matter much. Newline character and uppercase text matters as well, because sometimes lines contain mixed data (ingredient and nutritional fact, example line “CORN STARCH 250mg Protein 100% ADJ” should only extract “CORN STARCH” as ingredient).