Just posted an update on this thread with an updated version of textcat.teach
using the PatternMatcher
. (It still includes the entity labels, but you can easily filter them out using a function like yours above).
You could write a little wrapper for your stream that checks the _input_hash
, which will be identical for tasks with the same text, and either merges the spans, or removes the duplicates. (This depends on how you want the tasks to look – i.e. if you want all matches to be highlighted, or just the first one.)
Ah yes – sorry if this was confusing. We just ended up using eg
because it's short.