NER from user-generated content (spelling mistakes etc.)

Hi, thanks for your detailed questions and sorry for the delay on getting back to you! (I was in Edinburgh for EuroPython and only got back this weekend.) Answers below:

In general, yes – if you're dealing with user-generated content, it'll definitely be more efficient to cover as many spelling variations as possible (especially the most common ones). This is also the reason we use word vectors and terms.teach to extend our seed lists with other similar terms, including common misspellings. I actually found this part really helpful and interesting, because most of those misspellings were things I would have never thought of myself! Based on the dataset created with terms.teach, you can then create your patterns file using terms.to-patterns.

For German, spaCy doesn't ship with pre-trained word vectors (yet) – but you can use the FastText vectors and add them to a spaCy model, or train your own using Gensim and Word2Vec. This also leads over to the next question...

By default, word vectors are typically trained for single tokens, so it's difficult to use them to find more similar multi-word phrases out of the box. German is a little nicer here, because you end up more specific nouns ("Baumwollfieber" vs. "brown" + "lung" + "disease" – sorry, I just googled for a random Fieber :wink: This is also the reason why @honnibal usually advocates for a more German-like tokenization scheme in English that merges noun chunks by default).

If it turns out that you need more specific tokens for more specific noun phrases, you can can pre-process your corpus using spaCy and merge all noun phrases, using a similar logic to what you'd expect for your entities. This is also part of the trick we use in sense2vec (demo here in case you haven't seen it yet).

Training vectors on merged tokens can be very useful to find more similar expressions that you missed in your patterns. It's also a good sanity check – if the most similar results don't turn up anything useful in the vectors, it's often a good indicator that your intended NER classification scheme won't work very well either. NER works best for clearly defined categories of "things" that occur in similar contexts.

If you find that your entities are too complex, it's often a better approach to focus on a simpler definition and then extend the entity span using rules and/or the dependency parse. For example, instead of trying to teach the model that "hohes Fieber" is an entity, you could focus on predicting "Fieber" and then extend the entities to include all adjectives attached to it. This thread and this thread both have some more details and examples of this approach.

In ner.teach, you'd normally provide the seed terms of entity patterns via a patterns.jsonl file, which lets you specify a label:

{"label": "DISEASE", "pattern": [{"lower": "fieber"}]}
{"label": "MEDICATION", "pattern": [{"lower": "aspirin"}]}

This is used by Prodigy to identify which label the pattern refers to. You can use one large patterns file containing all labels if you like, even if you're only training one label at a time. For example, if you set --label DISEASE on ner.teach, Prodigy should only use the patterns for that label, even if your file contains other patterns as well.