textcat_teach from a file fails to launch

Hi,

I'm trying to adjust the texcat_teach recipe to prefer_high_scores.
For this I copied the texcat_teach recipe from GitHub and pasted it to a new *.py file "custom_teach.py"

The default texcat_teach recipe does seem to work fine and the server is being started like so:
$ prodigy textcat.teach nar_5 blank:en all_fin.txt --loader txt --label "nar_5_proration" --patterns anno_patterns.jsonl
Without changing anything in the code yet, when calling the recipe from the file I created with
$ prodigy textcat.teach nar_5 blank:en all_fin.txt --loader txt --label "nar_5_proration" --patterns anno_patterns.jsonl -F custom_teach.py
I get the error message

usage: prodigy textcat.teach [-h] [-l None] [-p None] [-e None] dataset spacy_model source
prodigy textcat.teach: error: unrecognized arguments: --loader txt

which seems odd, because should be the exact same code as in the default recipe. At this point, I did not even change to prefer_high_scores.
I then added a loader for TXT to the recipe in my file.
Now there seems to be an issue with the db-outjsonl file that I gave as patterns (even though this procedure did not raise any issues with the default recipe), it fails at the first line already:

`ValueError: Invalid pattern: {'text': 'It is anticipated that the Final Settlement Date will be July 7, 2020, the first business day after the Expiration Time.', '_input_hash': 964872305, '_task_hash': 137696480, 'options': [{'id': 'nar_sum_ins_per_bo', 'text': 'nar_sum_ins_per_bo'}, {'id': 'nar_sum_bo_discl', 'text': 'nar_sum_bo_discl'}, {'id': 'nar_sum_paper', 'text': 'nar_sum_paper'}, {'id': 'nar_sum_withdraw', 'text': 'nar_sum_withdraw'}, {'id': 'nar_1_details', 'text': 'nar_1_details'}, {'id': 'nar_2_consent', 'text': 'nar_2_consent'}, {'id': 'nar_3_instruct', 'text': 'nar_3_instruct'}, {'id': 'nar_4_proceed', 'text': 'nar_4_proceed'}, {'id': 'nar_5_proration', 'text': 'nar_5_proration'}, {'id': 'nar_6_doc', 'text': 'nar_6_doc'}, {'id': 'nar_7_restrictions', 'text': 'nar_7_restrictions'}, {'id': 'settle_date', 'text': 'settle_date'}, {'id': 'miex', 'text': 'miex'}, {'id': 'milt', 'text': 'milt'}, {'id': 'bois', 'text': 'bois'}, {'id': 'accrued', 'text': 'accrued'}, {'id': 'consent_fees', 'text': 'consent_fees'}, {'id': 'early_fees', 'text': 'early_fees'}, {'id': 'offer_price', 'text': 'offer_price'}], '_session_id': None, '_view_id': 'choice', 'config': {'choice_style': 'multiple'}, 'accept': ['settle_date'], 'answer': 'accept'}

So I decided to try a new file "nar_5_patterns" that should match the expected patterns syntax which looks like this:
{"label":"nar_5_proration","pattern":"As a result, no series of Notes accepted for 3 purchase will be prorated."} ...
with
$ prodigy textcat.teach nar_custom blank:en all_fin.txt --loader txt --label "nar_5_proration" --patterns nar_5_patterns.jsonl -F custom_teach.py
and I now get the error message

✘ Failed to load task (invalid JSON on line 1)

What am I missing?
Why would it work when loaded as default but not from a file? Is the recipe I got from GitHub outdated?
Is there another way to access and copy the textcat_teach recipe? I figure it should be stored somewhere on my system as it is accessed when launching the server, but I couldn't find it yet.

Thanks in advance!

Hi! Sorry to hear you were having trouble!

This error here happens because the simplified recipe example on GitHub doesn't have a --loader argument (see the function arguments and recipe argument annotations in the @prodigy.recipe decorator). The example recipe here calls the JSONL loader directly – so replacing it with the TXT loader like you did is fine.

The db-out command just exports an annotated dataset, which will be in Prodigy's JSON format – the patterns format is a bit different because patterns should describe a label and text/tokens that you're looking to match in the documents. See here for an example: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP So the error you saw here happened because the data you provided as patterns didn't have a key "pattern" or "label".

This typically happens when loading the source data as JSON – are you sure you're using the TXT loader in your recipe, and not accidentally loading the file as JSON, which would explain why Python's json.loads raises an error here for invalid JSON (because that's pretty much exactly what this custom error wraps).

Yes, you can also look at the recipes shipped with Prodigy! The easiest way is to run prodigy stats and find the location of your Prodigy installation. You can then look in the recipes directory.

Hi @ines ,

thanks a lot for your detailed answer!
I realized that I didn't change the stream = JSON(source) to stream = TXT(source) :man_facepalming:

I can now start the server. However, when setting prefer_high_scores, the annotation task ends with "No tasks available" quite fast every time (~30 annotations) with one match maximum. (Update: also with prefer_uncertain as well as prefer_low_scores)
I have a dataset with about 9000 sentences, by the keyword search I did I expect to be around 280 sentences to be of the label in question.
My patterns file contains 42 sentences.
Is my patterns file simply too small to yield more highly scored results?

On another note: is there a possibility to create a patterns file from the db-out output and not from the dataset name itself similar as in terms.to_patterns? I have such a file from a former colleague who didn't create a pattern file, though...

Thanks again for your help!

No worries, glad you got it working!

Are these full sentences that occur exactly like that in the texts? The patterns will select examples based on whether the pattern matches – so maybe you're not getting enough matches? In that case, your model will start off knowing nothing, learn very little, and then never predict anything meaningful. Can you make the patterns a bit more general, and focus more on certain words or expressions? It's not a bad thing if the patterns produce false positives, either – those are also examples that you want the model to see, so it can learn that a certain expression is only an indicator for a label in some contexts, not others.

The terms.to-patterns command was designed with datasets in mind, so one option would be to use db-in to import the file, and then run terms.to-patterns to create the patterns. A bit inconvenient, though. Another option would be to just adapt the code: if you look at the file recipes/terms.py in your Prodigy installation, you can see how it's implemented. The code that creates the patterns is actually very straightforward:

examples = []  # your annotations to convert
nlp = spacy.blank("en")  # or whatever
case_sensitive = False  # or True

def get_pattern(term, label):
    if nlp is not None:
        if case_sensitive:
            pattern = [{"text": t.text} for t in nlp.make_doc(term)]
        else:
            pattern = [{"lower": t.lower_} for t in nlp.make_doc(term)]
    else:
        pattern = term
    return {"label": label, "pattern": pattern}

terms = [eg.get("word", eg["text"]) for eg in examples if eg["answer"] == "accept"]
patterns = [get_pattern(term, label) for term in terms]

Thanks again for the detailed explanations!

Are these full sentences that occur exactly like that in the texts?

The sentences in my patterns file are very similar to those I'm looking for, might even be the same for some.

Can you make the patterns a bit more general, and focus more on certain words or expressions?

I'll definitely try that! Some key words should be found in the majority of sentences, so I guess this might be a good way to go. Would it make sense to add these key words to the file I have so whole sentences can be learned, as well, or start off with key words only?
Related to that: Is it a good idea to use an empty model blank:en as I've been doing? I figured one like en_core_web_sm might still be too big to be affected by the few annotations I'm adding?

Yes, that sounds like a good plan. The pattern matches are only used to select the examples and they won't have an impact on the actual model, so it's fine if it's just keywords. The matcher will look for exact matches so if you feed in whole sentences, you'd only be seeing a match if the exact sentence is found.

In your case, it doesn't matter because you're training a new text classifier from scratch anyways :slightly_smiling_face: There's no pretrained text classifier in the en_core_web_sm pipeline, and it also doesn't have any word vectors etc. So the result should be the same if you're starting with blank:en. You could try using a package with word vectors – that might help the model get more accurate faster.

The only other argument for using a trained pipeline instead of blank:en is related to the patterns: if you're using token-based patterns that rely on the model's predictions (e.g. match word only if it's a verb, match word based on its lemma etc.) you'll obviously need a pipeline that can give you these predictions. But if you're only using exact string matches in your patterns, that doesn't matter.