Seeds not recognized by textcat.teach

solved
textcat
#1

Hi, I’m new to prodigy and I’m successfully running the following code to label some data

python -m prodigy textcat.teach testing en_core_web_sm sentences.txt --loader TXT --label POSITIVE --exclude testing

I want to add some seed terms (seeds are topic specific). my seed list is a txt file that looks like the following:
ABLE
ABUNDANCE
ABUNDANT
ACCLAIMED
ACCOMPLISH
ACCOMPLISHED

I tried the following code
python -m prodigy textcat.teach testing en_core_web_sm sentences.txt --loader TXT --label POSITIVE --exclude testing --seeds positive.txt

but I get the following error:
prodigy textcat.teach: error: unrecognized arguments: --seed positive.txt

I also tried to add the terms into a dataset and use it as seed but I get the same error
Thanks for the help

Vincenzo

0 Likes

(Ines Montani) #2

Ah, sorry about that! If you’re using the latest version of Prodigy, the --seeds argument has been replaced with a --patterns argument – see here for the available recipe arguments, and here for an example of the patterns. You’ll also find more info about this in your PRODIGY_README.html. We should probably add a note about this to our text classification video! :+1:

The change makes textcat.teach consistent with ner.teach, and gives you more flexibility for the seed terms. Instead of just terms, you can now also specify token patterns, similar to the patterns for spaCy’s rule-based Matcher. For example:

{"label": "POSITIVE", "pattern": [{"lower": "able"}]}

This will match all tokens whose lowercase form equals “able”. You can also write more complex rules that take into account part-of-speech tags or dependency labels – which sounds like it might help a lot for your use case.

You then then save the patterns as a .jsonl file and load them in via the --patterns argument:

python -m prodigy textcat.teach testing en_core_web_sm sentences.txt --loader TXT --label POSITIVE --exclude testing --patterns patterns.jsonl

To experiment with different match patterns and how to capture different types of phrases, you can also try out our demo:

1 Like

unrecognized arguments: --seeds in textcat.teach
#3

Excellent! Thank you Ines, I’ll try the new command with the JSON file

1 Like

(Sam) #4

Dear Ines,

Ik currently experiencing a very similar problem. I’m following your text classification video. I’ve made a significant terms dataset the way you did it.

prodigy dataset names_seed

:sparkles: Successfully added ‘names_seed’ to database SQLite.

prodigy terms.teach names_seed en_vectors_web_lg --seeds “****, ****, ****, ****”
Initialising with 4 seed terms: ****, ****, ****, ****

:sparkles: Starting the web server at http://localhost:8080
Open the app in your browser and start annotating!

^C
Saved 1000 annotations to database SQLite
Dataset: names_seed
Session ID: 2018-09-05_15-51-30

Now I would like to continue with

prodigy dataset names

:sparkles: Successfully added ‘names’ to database SQLite.

textcat.teach names en_core_web_sm /Users/student/Downloads/out.txt --label NAME --seeds names_seed

As you’ve said about the --seeds argument is renamed but if I just replace it by --patterns it is not able to run the line since it’s searching for a json file. How am I able to use the terms dataset within prodigy?

Thanks in advance
Sam

0 Likes

(Ines Montani) #5

You can use the terms.to-patterns recipe to convert your names_seed dataset to a patterns.jsonl file, for example:

prodigy terms.to-patterns names_seed /tmp/patterns.jsonl --label NAME

This will create a file patterns.jsonl, which you can then use by setting --patterns /path/to/patterns.jsonl. See here for an example.

0 Likes

(Sam) #6

Thanks, got that working now in a couple of minutes. Little question, what’s the difference between the previous seeds and the current patterns?

0 Likes

(Ines Montani) #7

The patterns are much more flexible, because they allow you to describe multi-word phrases and token attributes, similar to the patterns for spaCy’s rule-based matcher.

For example, let’s say you’re classifying news articles and you want to train a model to detect whether a text is about a company sale. Good indicators of this might be words like “acquired”, “acquires”, “aquire” etc. Instead of using all possible forms as seed terms, you could write a pattern that matches all words that have the lemma “aquire”:

{"label": "COMPANY_SALE", "pattern": [{"lemma": "acquire"}]}

You can also incorporate other token attributes (lowercase form, shape, alphanumeric vs. numbers), as well as statistical predictions like the part-of-speech tags, dependency labels and entity types. The following pattern will match a token with the lemma “be” (“am”, “is”, “was” etc.), followed by a determiner (an article, e.g. “the”, “a”), followed by a noun:

{"label": "SOME_LABEL", "pattern": [{"lemma": "be"}, {"pos": "DET"}, {"pos": "NOUN"}]}

Here are some slides for a talk I gave on this topic that have some more details on the approach, as well as pattern examples:

To explore different types of patterns and test them on your texts, you might also find our interactive Matcher demo useful:

0 Likes

(Sam) #8

Great thanks!

I’ll most definitely take a look at those links!

0 Likes

(Robyn P) #9

Hi Ines,

My colleague and I are also new to Prodi.gy and following along with the Insults Classifier demo. Your explanation here to replace seeds with patterns helped us keep moving through the demo. Question: in the demo you mention that the “via_seeds” parameter is sometimes on while annotating. Now that we’re using the patterns paradigm instead, is there a way to tell that the patterns/seeds are being used to select stuff to annotate? I ask because it feels like a lot of the annotations I’m doing are not at all related to the original seed terms I created. Thank you!

0 Likes

(Ines Montani) #10

Yes, it should say “Pattern” and the pattern ID (line number of the pattern in your patterns file) in the bottom right corner now. If you’re only seeing the score and no patterns reference, the suggestion comes from the model.

If you’re not seeing any pattern matches, the most common explanation is that the patterns don’t match – for example, because the tokens they describe don’t match the actual tokenization. A quick way to test your patterns is to use a recipe like ner.match, wich will just stream in the matches with no model suggestions. If nothing is matched, the patterns don’t apply. You might also find our matcher demo useful (see link above), which lets you build and test match patterns interarctively.

0 Likes

Training Insults classifier video out of date (--seeds argument) and moved documentation
(Robyn P) #11

Ok, thanks for the tips. The patterns are just a list of words translated into patterns by:

prodigy terms.to-patterns insults_seeds /tmp/insults_patterns.jsonl --label INSULT

And now I finally am seeing some pattern matches. Thank you for the super speedy response.

1 Like