Seeding text categorization with phrases

Just posted an update on this thread with an updated version of textcat.teach using the PatternMatcher. (It still includes the entity labels, but you can easily filter them out using a function like yours above).

You could write a little wrapper for your stream that checks the _input_hash, which will be identical for tasks with the same text, and either merges the spans, or removes the duplicates. (This depends on how you want the tasks to look – i.e. if you want all matches to be highlighted, or just the first one.)

Ah yes – sorry if this was confusing. We just ended up using eg because it's short.