Seeds not recognized by textcat.teach

Vincsi · May 3, 2018, 4:38pm

Hi, I’m new to prodigy and I’m successfully running the following code to label some data

python -m prodigy textcat.teach testing en_core_web_sm sentences.txt --loader TXT --label POSITIVE --exclude testing

I want to add some seed terms (seeds are topic specific). my seed list is a txt file that looks like the following:
ABLE
ABUNDANCE
ABUNDANT
ACCLAIMED
ACCOMPLISH
ACCOMPLISHED

I tried the following code
python -m prodigy textcat.teach testing en_core_web_sm sentences.txt --loader TXT --label POSITIVE --exclude testing --seeds positive.txt

but I get the following error:
prodigy textcat.teach: error: unrecognized arguments: --seed positive.txt

I also tried to add the terms into a dataset and use it as seed but I get the same error
Thanks for the help

Vincenzo

ines · May 3, 2018, 4:48pm

Ah, sorry about that! If you're using the latest version of Prodigy, the --seeds argument has been replaced with a --patterns argument – see here for the available recipe arguments, and here for an example of the patterns. You'll also find more info about this in your PRODIGY_README.html. We should probably add a note about this to our text classification video!

The change makes textcat.teach consistent with ner.teach, and gives you more flexibility for the seed terms. Instead of just terms, you can now also specify token patterns, similar to the patterns for spaCy's rule-based Matcher. For example:

{"label": "POSITIVE", "pattern": [{"lower": "able"}]}

This will match all tokens whose lowercase form equals "able". You can also write more complex rules that take into account part-of-speech tags or dependency labels – which sounds like it might help a lot for your use case.

You then then save the patterns as a .jsonl file and load them in via the --patterns argument:

python -m prodigy textcat.teach testing en_core_web_sm sentences.txt --loader TXT --label POSITIVE --exclude testing --patterns patterns.jsonl

To experiment with different match patterns and how to capture different types of phrases, you can also try out our demo:

Vincsi · May 3, 2018, 7:57pm

Excellent! Thank you Ines, I’ll try the new command with the JSON file

SamD · September 5, 2018, 2:50pm

Dear Ines,

Ik currently experiencing a very similar problem. I'm following your text classification video. I've made a significant terms dataset the way you did it.

prodigy dataset names_seed

Successfully added 'names_seed' to database SQLite.

prodigy terms.teach names_seed en_vectors_web_lg --seeds "****, ****, ****, ****"
Initialising with 4 seed terms: ****, ****, ****, ****

Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

^C
Saved 1000 annotations to database SQLite
Dataset: names_seed
Session ID: 2018-09-05_15-51-30

Now I would like to continue with

prodigy dataset names

Successfully added 'names' to database SQLite.

textcat.teach names en_core_web_sm /Users/student/Downloads/out.txt --label NAME --seeds names_seed

As you've said about the --seeds argument is renamed but if I just replace it by --patterns it is not able to run the line since it's searching for a json file. How am I able to use the terms dataset within prodigy?

Thanks in advance
Sam

ines · September 5, 2018, 7:28pm

You can use the terms.to-patterns recipe to convert your names_seed dataset to a patterns.jsonl file, for example:

prodigy terms.to-patterns names_seed /tmp/patterns.jsonl --label NAME

This will create a file patterns.jsonl, which you can then use by setting --patterns /path/to/patterns.jsonl. See here for an example.

SamD · September 6, 2018, 7:18am

Thanks, got that working now in a couple of minutes. Little question, what’s the difference between the previous seeds and the current patterns?

ines · September 6, 2018, 12:16pm

The patterns are much more flexible, because they allow you to describe multi-word phrases and token attributes, similar to the patterns for spaCy's rule-based matcher.

For example, let's say you're classifying news articles and you want to train a model to detect whether a text is about a company sale. Good indicators of this might be words like "acquired", "acquires", "aquire" etc. Instead of using all possible forms as seed terms, you could write a pattern that matches all words that have the lemma "aquire":

{"label": "COMPANY_SALE", "pattern": [{"lemma": "acquire"}]}

You can also incorporate other token attributes (lowercase form, shape, alphanumeric vs. numbers), as well as statistical predictions like the part-of-speech tags, dependency labels and entity types. The following pattern will match a token with the lemma "be" ("am", "is", "was" etc.), followed by a determiner (an article, e.g. "the", "a"), followed by a noun:

{"label": "SOME_LABEL", "pattern": [{"lemma": "be"}, {"pos": "DET"}, {"pos": "NOUN"}]}

Here are some slides for a talk I gave on this topic that have some more details on the approach, as well as pattern examples:

To explore different types of patterns and test them on your texts, you might also find our interactive Matcher demo useful:

SamD · September 6, 2018, 1:11pm

Great thanks!

I’ll most definitely take a look at those links!

rp-navera · January 23, 2019, 6:45pm

Hi Ines,

My colleague and I are also new to Prodi.gy and following along with the Insults Classifier demo. Your explanation here to replace seeds with patterns helped us keep moving through the demo. Question: in the demo you mention that the “via_seeds” parameter is sometimes on while annotating. Now that we’re using the patterns paradigm instead, is there a way to tell that the patterns/seeds are being used to select stuff to annotate? I ask because it feels like a lot of the annotations I’m doing are not at all related to the original seed terms I created. Thank you!

ines · January 23, 2019, 6:59pm

Yes, it should say "Pattern" and the pattern ID (line number of the pattern in your patterns file) in the bottom right corner now. If you're only seeing the score and no patterns reference, the suggestion comes from the model.

If you're not seeing any pattern matches, the most common explanation is that the patterns don't match – for example, because the tokens they describe don't match the actual tokenization. A quick way to test your patterns is to use a recipe like ner.match, wich will just stream in the matches with no model suggestions. If nothing is matched, the patterns don't apply. You might also find our matcher demo useful (see link above), which lets you build and test match patterns interarctively.

rp-navera · January 23, 2019, 8:27pm

Ok, thanks for the tips. The patterns are just a list of words translated into patterns by:

prodigy terms.to-patterns insults_seeds /tmp/insults_patterns.jsonl --label INSULT

And now I finally am seeing some pattern matches. Thank you for the super speedy response.

Topic		Replies	Views
unrecognized arguments: --seeds in textcat.teach usage , textcat , solved	1	991	March 12, 2019
Textcat.teach not using the pattern file enhancement , textcat , done	10	1917	September 20, 2022
Is there a way to highlight seeded terms in textcat.teach? enhancement , textcat , done	5	1802	January 29, 2020
Training Insults classifier video out of date (--seeds argument) and moved documentation docs	4	669	February 8, 2019
Text Classification, Bootstrapping Error textcat	1	671	June 7, 2018

Seeds not recognized by textcat.teach

Related topics