Issues with text classification, Invalid Pattern of JSON files for terminology list

Hi there

I am trying to build a model for text classification to classify sentences that are History, i have done the following steps:

  1. Build a terminology list/seed terms using Prodigy using the following commands:
python -m prodigy dataset History_seed "Collect seeds for History"
python -m prodigy terms.teach History_seed en_core_web_lg --seeds history.txt
  1. Output the seeds collected in the previous step as a jsonl file using the following commands:
python -m prodigy db-out History_seed > History_terms.jsonl
  1. Annotating sentences that belong to a particular class, using the following commands:
python -m prodigy dataset History_anno "collect annotations History"
python -m prodigy textcat.teach History_anno en_core_web_lg history_train.jsonl -- label History -- patterns History_terms.jsonl 

(history_train.jsonl is the jsonl which contain the training set which i created earlier)

Unfortunately, i have faced errors when running the last command, which seems to suggest that the pattern for the terminology jsonl file is not recognised, the following is the error

D:\usersp\admin\Desktop\Prodigy\History>python -m prodigy textcat.teach History_anno en_core_web_lg history_train.jsonl --label HISTORY --patterns History_terms.jsonl
Using 1 labels: HISTORY
Traceback (most recent call last):
  File "D:\usersp\admin\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\usersp\admin\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\usersp\admin\Anaconda3\lib\site-packages\prodigy\__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "D:\usersp\admin\Anaconda3\lib\site-packages\plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "D:\usersp\admin\Anaconda3\lib\site-packages\plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "D:\usersp\admin\Anaconda3\lib\site-packages\prodigy\recipes\textcat.py", line 58, in teach
    matcher = matcher.from_disk(patterns)
  File "cython_src\prodigy\models\matcher.pyx", line 192, in prodigy.models.matcher.PatternMatcher.from_disk
  File "cython_src\prodigy\models\matcher.pyx", line 118, in prodigy.models.matcher.PatternMatcher.add_patterns
  File "cython_src\prodigy\models\matcher.pyx", line 55, in prodigy.models.matcher.create_matchers
  File "cython_src\prodigy\models\matcher.pyx", line 29, in prodigy.models.matcher.parse_patterns
ValueError: Invalid pattern: {'text': 'pmhx', 'answer': 'accept', '_input_hash': -1857482619, '_task_hash': -1534345529}

I have previously done text classification in the same manner and was able to proceed with the annotation. Would you be able to share on what is the issue that i am facing ?

This is an example of the content of the History_terms.jsonl file

{"text":"years","answer":"accept","_input_hash":-766644989,"_task_hash":-303167857}
{"text":"history","answer":"accept","_input_hash":-718773650,"_task_hash":-52608476}
{"text":"medical history","answer":"accept","_input_hash":-1321831807,"_task_hash":-2121037943}
{"text":"university","meta":{"score":0.7690472063},"_input_hash":-968060743,"_task_hash":1617902080,"answer":"accept"}
{"text":"education","meta":{"score":0.7611488893},"_input_hash":389341424,"_task_hash":-303253017,"answer":"reject"}
{"text":"student","meta":{"score":0.7539709622},"_input_hash":1449487300,"_task_hash":1808011982,"answer":"accept"}

Thanks in advance!
I m using Prodigy version 1.6.1

Hi! Thanks for sharing your defailed workflow and commands. All looks good – I think there's just a small problem here:

The db-out command will export the exact contents of the dataset – so pretty much exactly what you saw on the screen when you annotated. For example:

{"text": "student", "answer": "accept"}

The pattern files on the other hand are token-based descriptions of the terms plus label:

{"label": "HISTORY", "pattern": [{"lower": "student"}]}

To create those, Prodigy provides a handy recipe terms.to-patterns that takes a dataset name and label and creates a patterns file using the accepted terms:

prodigy terms.to-patterns History_seed --label HISTORY > history_terms.jsonl

Thanks a lot :):slight_smile: