Thanks for the report! Are you using the latest version, Prodigy v1.4.1
?
And are you using a patterns file with examples of the labels you're trying to annotate? When you start from scratch, Prodigy has no concept of any of those labels, and if you just stream in a lot of raw data, it can take a lot of time for the model to learn.
To pre-select texts based on keyword matches, you can pass in a patterns.jsonl
file via the --patterns
argument. The patterns file can include token descriptions like the ones used by spaCy's Matcher
. For example:
{"label": "CITY", "pattern": [{"lower": "new"}, {"lower": "york"}]}
{"label": "CITY", "pattern": [{"lower": "paris"}]}
For example, if you're classifying whether a text is about a city, the above patterns will tell Prodigy to select all texts mentioning "new york" and "paris" and label them "CITY", so you can say yes or no to them. You can find more details on this in your PRODIGY_README.html
. Our video tutorial on training a new entity type also shows the usage of patterns. The patterns in textcat.teach
work the same way.
Your script looks good – but you might not even need it. Prodigy supports loading in data from a .txt
file and the built-in loader includes a few more additional checks, too – like making sure your example texts aren't empty strings.
prodigy textcat.teach lpdata3 en_core_web_sm user_says_from_orig_data_23_march_2.txt --label l1,l2,l3,l4,l5 etc.
I'm not sure I understand this correctly – do you mean the total examples you're annotating? Prodigy's textcat.teach
recipe uses the model in the loop to suggest the most relevant examples for annotation. These are usually the ones that the model is most unsure about, i.e. the ones with a prediction closest to 0.5
. This also means that Prodigy won't ask you about all examples – only the most important ones that produce the best possible training data. This is usually faster and more efficient than labelling every single example.
If you need to annotate all examples in your dataset to create gold-standard annotations, you might want to use a different recipe instead and skip the active learning component. For example, you could use the choice
interface, display each example with your label options and let the annotator select one or more. You can find an example of this in the custom recipes workflow here.