NER from user-generated content (spelling mistakes etc.)

Hi,

did you already use a dataset of user-generated content, i.e. data that contains a lot of spelling mistakes? If so, does it make sense to use a seed list containing not only entities that are spelled correctly but also those spelled incorrectly?

How long should such a seed list be? In your videos, the seed lists had just a handful of terms, but is that enough when the named entities are quite diverse?

What I’m worrying about as well are entities that comprise more than one word, often two or three or even more words (e.g. noun phrase consisting of a head noun plus a prepositional phrase). Would you rather leave those entities out completely and not annotate them at all since they are so long and difficult to learn and predict? Or should I annotate them and hope that the model somehow learns them, even if they are so diverse? Should I include them in the seed list?

Thanks in advance :slight_smile:

What I forgot to ask: When you want to train a model für two named entity categories, should you then make two separate seed lists or how do you make sure that prodigy understands that some terms are seeds for entity A and the others are seeds for entity B?

Hi, thanks for your detailed questions and sorry for the delay on getting back to you! (I was in Edinburgh for EuroPython and only got back this weekend.) Answers below:

In general, yes – if you're dealing with user-generated content, it'll definitely be more efficient to cover as many spelling variations as possible (especially the most common ones). This is also the reason we use word vectors and terms.teach to extend our seed lists with other similar terms, including common misspellings. I actually found this part really helpful and interesting, because most of those misspellings were things I would have never thought of myself! Based on the dataset created with terms.teach, you can then create your patterns file using terms.to-patterns.

For German, spaCy doesn't ship with pre-trained word vectors (yet) – but you can use the FastText vectors and add them to a spaCy model, or train your own using Gensim and Word2Vec. This also leads over to the next question...

By default, word vectors are typically trained for single tokens, so it's difficult to use them to find more similar multi-word phrases out of the box. German is a little nicer here, because you end up more specific nouns ("Baumwollfieber" vs. "brown" + "lung" + "disease" – sorry, I just googled for a random Fieber :wink: This is also the reason why @honnibal usually advocates for a more German-like tokenization scheme in English that merges noun chunks by default).

If it turns out that you need more specific tokens for more specific noun phrases, you can can pre-process your corpus using spaCy and merge all noun phrases, using a similar logic to what you'd expect for your entities. This is also part of the trick we use in sense2vec (demo here in case you haven't seen it yet).

Training vectors on merged tokens can be very useful to find more similar expressions that you missed in your patterns. It's also a good sanity check – if the most similar results don't turn up anything useful in the vectors, it's often a good indicator that your intended NER classification scheme won't work very well either. NER works best for clearly defined categories of "things" that occur in similar contexts.

If you find that your entities are too complex, it's often a better approach to focus on a simpler definition and then extend the entity span using rules and/or the dependency parse. For example, instead of trying to teach the model that "hohes Fieber" is an entity, you could focus on predicting "Fieber" and then extend the entities to include all adjectives attached to it. This thread and this thread both have some more details and examples of this approach.

In ner.teach, you'd normally provide the seed terms of entity patterns via a patterns.jsonl file, which lets you specify a label:

{"label": "DISEASE", "pattern": [{"lower": "fieber"}]}
{"label": "MEDICATION", "pattern": [{"lower": "aspirin"}]}

This is used by Prodigy to identify which label the pattern refers to. You can use one large patterns file containing all labels if you like, even if you're only training one label at a time. For example, if you set --label DISEASE on ner.teach, Prodigy should only use the patterns for that label, even if your file contains other patterns as well.

Hi Ines,

thank you for your detailed reply and I hope EuroPython was good, it sounds so exciting! :blush:

This approach sounds reasonable. What I'm trying to teach the model is to predict a specific "disease (or rather a health problem)" like "intestine problems" and "problems with the intestine" but not "problems" as in "I have problems." So I only want to recognize specific problems that are either specified by another noun or by a prepositional phrase.
Do you think that this is feasible? And how shall I start then?

If I focused on predicting "problems" every time they occur, later the model would have to unlearn that "problems" without any specification are NOT diseases anymore. I'd imagine this to be rather difficult, or is it not?

Would spaCy be able to merge phrases like "Magen Darm Probleme" as well? From the surface this would seem like three nouns but actually it's just one noun phrase (spelt correctly, you'd have dashes between the nouns like so: "Magen-Darm-Probleme").

Yes, I think in your case, it definitely makes more sense to focus on the entities that are specific to your domain, like diseases or even body parts / organs. Those are both categories your model could learn to predict – and once you have them, you can add various rules on top of it, even for other things you want to analyse later (medication, treatment, sentiment etc.). For instance, if you know that intestines are a BODY_PART (or however else you want to label it), you can then look at the dependency parse and the surrounding words and find out if it's used in relation to a problem. Or if it's connected to a DISEASE entity, or mentioned in the context of a MEDICATION entity, etc.

Sure! I just tested it using the small German model via our displaCy visualizer and here are two parses – one without merging, and the other one with the "merge phrases" option enabled (that basically just merges all noun chunks in doc.noun_chunks):

44

It's always important to keep in mind that the German model was trained on general news text, so it might not always perform perfectly on user-generated text with typos out-of-the-box. But this is something you can fix and fine-tune using Prodigy and the pos.teach and dep.teach recipes. Even a few hundred annotations on your text can already make a difference. If you can increase part-of-speech tag accuracy on your texts, you'll also be able to extract and merge noun phrases more efficiently.

This sounds like a good idea, but would probably be a bit too much for my thesis :grin: But we might follow that direction later on! So thanks again, Ines, for your ideas!