Text classification, adding words to bootstrap list after creating a dataset

paige · June 7, 2018, 1:10pm

Hi there! Was wondering if it was possible to add words to your bootstrap list after having already created your dataset. I ask because sometimes while annotating, I stumble upon words that correspond well with the lexical field and feel the urge to add them to the already existing bootstrap list.

ines · June 7, 2018, 1:38pm

This is currently not easily posible while the server is running because by then, the pattern matcher is already created and there’s no easy way to update it with new patterns while also updating your text classifier in the loop.

However, you can always add more terms to your terms dataset manually, and then create new patterns. The db-in command lets you import annotations and it supports the same file formats as Prodigy’s loaders. So if you have a text file with one term by line, you could do:

prodigy db-in terms_dataset new_terms.txt
prodigy terms.to-patterns terms_dataset /patterns.jsonl --label LABEL

This will import your new terms, mark them as accepted and create a new patterns list from the updated dataset.

(Btw, I’m now thinking of ways to update the pattern matcher “live” while the server is running… It’s probably possible, but it’d require a custom recipe, some tricks and some experimentation, so I’m not sure if it’s worth it. For example, you could use the ner_manual interface instead of the classification interface and one label, ADD_TO_SEEDS or something. If you come across a term you want to add, you highlight it. The update callback in the recipe then checks the answers it recieves back for "spans" and if it finds any, it will update the PatternMatcher with a new pattern. This pattern would then already be applied on the next batch. Basically, a similar logic to how Prodigy does the active learning. But this is just an idea – I haven’t atually tested it yet.)

Topic		Replies	Views
Text Classification, Bootstrapping Error textcat	1	678	June 7, 2018
textcat.teach repeatedly annotating the same text, not annotating entire text at once usage , textcat	1	631	November 22, 2019
Pattern files for textcat.teach usage , textcat	20	3787	July 6, 2018
Seeds not recognized by textcat.teach textcat , solved	10	3307	January 23, 2019
Issues with text classification, Invalid Pattern of JSON files for terminology list usage , textcat , terms , solved	2	705	March 21, 2019

Text classification, adding words to bootstrap list after creating a dataset

Related topics