Text classification, adding words to bootstrap list after creating a dataset

Hi there! Was wondering if it was possible to add words to your bootstrap list after having already created your dataset. I ask because sometimes while annotating, I stumble upon words that correspond well with the lexical field and feel the urge to add them to the already existing bootstrap list.

This is currently not easily posible while the server is running because by then, the pattern matcher is already created and there’s no easy way to update it with new patterns while also updating your text classifier in the loop.

However, you can always add more terms to your terms dataset manually, and then create new patterns. The db-in command lets you import annotations and it supports the same file formats as Prodigy’s loaders. So if you have a text file with one term by line, you could do:

prodigy db-in terms_dataset new_terms.txt
prodigy terms.to-patterns terms_dataset /patterns.jsonl --label LABEL

This will import your new terms, mark them as accepted and create a new patterns list from the updated dataset.

(Btw, I’m now thinking of ways to update the pattern matcher “live” while the server is running… It’s probably possible, but it’d require a custom recipe, some tricks and some experimentation, so I’m not sure if it’s worth it. For example, you could use the ner_manual interface instead of the classification interface and one label, ADD_TO_SEEDS or something. If you come across a term you want to add, you highlight it. The update callback in the recipe then checks the answers it recieves back for "spans" and if it finds any, it will update the PatternMatcher with a new pattern. This pattern would then already be applied on the next batch. Basically, a similar logic to how Prodigy does the active learning. But this is just an idea – I haven’t atually tested it yet.)