Help Using Patterns

Hello,

I am trying to create a pattern that finds a combination of words separated by an unknown number of whitespaces and/or tokens. So, for example, I have multiple pieces of text, each being fed as an annotation to textcat.teach:

“Hey, can you please verify your zipcode.”

“Could you verify your name, address and zipcode.”

“To verify your account, can you please confirm your zipcode?”

In the above example, I want to create a pattern that matches on the combination of the words “verify” and “zipcode”, regardless of the amount of whitespace and tokens that separate them within each given text. Is this possible?

Prodigy's token match patterns follow the same style as spaCy's Matcher patterns and are based on the already tokenized text (so you usually don't have to worry about whitespace). You can also use operators and quantifiers like "OP": "*" to match a token zero or more times, or use an empty dictionary {} for "any token".

If possible, I would still recommend keeping the patterns as short as possible and focusing on the most important words and phrases. You'll still click accept/reject later on anyways, and keeping the patterns open also helps you identify and annotate false positives. For example, you might want to use a pattern that's only [{"lower": "verify"}]. This will probably match a lot of other stuff, too, but doing a few hundred of those annotations often helps with getting a better feeling for the data, and identifying more patterns. (Plus, you get to teach your text classifier that "verify" is an important keyword – but only in certain contexts.)

You could also experiment with including token attributes like part-of-speech tags or dependencies. Here's an example of one of your sentences and the POS tags and dependencies predicted by the small English model:

doc = nlp("Hey, can you please verify your zipcode")
print([(t.text, t.pos_, t.dep_) for t in doc])
# [('Hey', 'INTJ', 'intj'), (',', 'PUNCT', 'punct'), 
# ('can', 'VERB', 'aux'), ('you', 'PRON', 'nsubj'), 
# ('please', 'INTJ', 'intj'), ('verify', 'VERB', 'ROOT'), 
# ('your', 'ADJ', 'poss'), ('zipcode', 'NOUN', 'dobj')]

You could look for tokens "zipcode" that are direct objects ({"lower": "zipcode", "dep": "dobj"}) – this would match "verify your zipcode" but not "my zipcode is". Another one would be
the token "verify", but only if it's the root of a sentence ({"lower": "verify", "dep": "ROOT"}). Combinations of "verify"/"confirm", "your" plus noun also seem like good candidates – for example:

[{"lower": "confirm"}, {"lower": "your"}, {"pos": "NOUN"}]

To test your patterns, you might also find our new matcher demo useful:

You can create token patterns on the left (one box stands for one token) and enter your text on the right. The matches are highlighted, and checking "Show tokens" lets you see how spaCy tokenizes the text and what matches (and what doesn't).

1 Like

Thank you so much! This is very helpful! :grin:

@ines Hey, thanks for your reply, I got some useful information from it but still got some puzzles. For me, I am gonna use active learning to help train a sentence classifier to recognize how machine learning(ML) algorithms were used in sentences. For example:
Simply mention: One of them is that the SVM is very sensitive to outliers or noises because of over-fitting problems.
Use but do not compare: A support vector machines (SVM) model achieved the best overall performance and was selected to conduct a data-based sensitivity analysis.
Use and Compare: K-nearest neighbors (KNN), support vector machine (SVM), and multinomial logistic regression (MLR) were applied for classification modeling.

First, I need to collect some VERBS(see below) in sentences that contain at least one ML algorithm. Second, I will create some patterns from those VERBS to match/annotate the sentences in the corpus.
A (Simply mention): be
B(Use): use, apply, and so on.
C(compare): outperform, underperform, and so on.
So, the question is how can I create a pattern that can match sentences containing two VERBS, the one in B and the other one in C at the same time (Use and Compare)? Thanks!

I also have the same question as you. How do you solve it? @DanielG

Hi! Did you see my comment above? I explained how matcher patterns can be used to solve this exact use case.

You might find the spaCy documentation on rule-based matching and creating matcher patterns helpful. This should give you an overview of how the patterns work, the different operators and how you can write custom patterns to express your logic:

In your example, you could create a pattern that matches a token that's a verb with the lemma "use", zero or more tokens in between, and a token that's a verb with the lemma "compare". Or the other way around. For example:

[{"lemma": "use"}, {"OP": "*"}, {"lemma": "compare"}]

Or, to include multiple lemmas:

[{"lemma": {"IN": ["use", "apply"]}}, {"OP": "*"}, {"lemma": {"IN": ["compare", "outperform"]}}]

Thanks for your reply.@ines I saw your comments above and then I solved the questions I proposed. However, I am now facing a new issue when I input the command line code in Pycharm. Below you can see the information about the issue. How could this happen? Do you have any idea about it? Thanks!

(venv) C:\Users\Jayshow\PycharmProjects\prodigy>python -m prodigy textcat.teach sentscat_algoFunc en_core_web_trf ./algorithmSents_demo.jsonl --label MENTION,USE-COMPARE,USE+COMPARE,EXTEND-COMPARE,EXTEND+COMPARE --patterns ./CUE_USE+EXTEND-COMPARE_REGX_.jsonl
Using 5 label(s): MENTION, USE-COMPARE, USE+COMPARE, EXTEND-COMPARE, EXTEND+COMPARE
C:\Users\Jayshow\PycharmProjects\prodigy\venv\lib\site-packages\toolz\itertoolz.py:242: UserWarning: [W036] The component 'matcher' does not have
 any patterns defined.
  yield next(itr)

✨  Starting the web server at http://localhost:8082 ...
Open the app in your browser and start annotating!

⚠ The running recipe is configured for multiple annotators using named sessions
with feed_overlap=True, but a client is requesting questions using the default
session. For this recipe, open the app with ?session=name added to the URL or
set feed_overlap to False in your configuration.
⚠ The running recipe is configured for multiple annotators using named sessions
with feed_overlap=True, but a client is requesting questions using the default
session. For this recipe, open the app with ?session=name added to the URL or
set feed_overlap to False in your configuration.
⚠ The running recipe is configured for multiple annotators using named sessions
with feed_overlap=True, but a client is requesting questions using the default
session. For this recipe, open the app with ?session=name added to the URL or
set feed_overlap to False in your configuration.
⚠ The running recipe is configured for multiple annotators using named sessions
with feed_overlap=True, but a client is requesting questions using the default
session. For this recipe, open the app with ?session=name added to the URL or
set feed_overlap to False in your configuration.

Below are the python libraries installed in the env of my laptop.

(venv) C:\Users\Jayshow\PycharmProjects\prodigy>pip list
Package             Version
------------------- ---------
0x-json-schemas     2.1.0
aiofiles            0.7.0
argon2-cffi         20.1.0
arxivcheck          0.3.2
async-generator     1.10
attrs               21.2.0
backcall            0.2.0
bibtexparser        1.2.0
bleach              3.3.0
blis                0.7.4
boto3               1.18.18
botocore            1.21.18
cachetools          4.2.2
catalogue           2.0.6
certifi             2021.5.30
cffi                1.14.5
chardet             4.0.0
click               7.1.2
colorama            0.4.4
contextvars         2.4
cymem               2.0.5
dataclasses         0.6
decorator           5.0.9
defusedxml          0.7.1
doi2bib             0.4.0
en-core-web-lg      2.3.0
en-core-web-md      2.3.0
en-core-web-sm      2.2.0
en-core-web-trf     3.1.0
entrypoints         0.3
et-xmlfile          1.1.0
fastapi             0.68.2
feedparser          6.0.8
filelock            3.0.12
ftfy                5.9
future              0.18.2
h11                 0.9.0
huggingface-hub     0.0.12
idna                2.10
immutables          0.15
importlib-metadata  4.6.0
ipykernel           5.5.5
ipython             7.16.1
ipython-genutils    0.2.0
ipywidgets          7.6.3
jedi                0.18.0
Jinja2              3.0.1
jmespath            0.10.0
joblib              1.0.1
jsonschema          3.2.0
jupyter             1.0.0
jupyter-client      6.1.12
jupyter-console     6.4.0
jupyter-core        4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets  1.0.0
lxml                4.6.3
MarkupSafe          2.0.1
mistune             0.8.4
murmurhash          1.0.5
mypy-extensions     0.4.3
nbclient            0.5.3
nbconvert           6.0.7
nbformat            5.1.3
nest-asyncio        1.5.1
notebook            6.4.0
numpy               1.19.5
openpyxl            3.0.7
packaging           20.9
pandas              1.1.5
pandocfilters       1.4.3
parso               0.8.2
pathy               0.6.0
pdfminer            20191125
peewee              3.14.4
pickleshare         0.7.5
Pillow              8.3.0
pip                 21.3.1
plac                1.1.3
preshed             3.0.5
prodigy             1.11.6
prometheus-client   0.11.0
prompt-toolkit      3.0.19
pycparser           2.20
pycryptodome        3.10.1
pydantic            1.7.4
Pygments            2.9.0
PyJWT               2.3.0
pyparsing           2.4.7
pypdf2xml           0.2
pyrsistent          0.18.0
python-dateutil     2.8.1
pytokenizations     0.7.2
pytz                2021.1
pywin32             301
pywinpty            1.1.3
PyYAML              6.0
pyzmq               22.1.0
qtconsole           5.1.0
QtPy                1.9.0
regex               2021.8.3
requests            2.25.1
s3transfer          0.5.0
sacremoses          0.0.45
scihub2pdf          0.4.2
selenium            3.141.0
Send2Trash          1.7.1
sense2vec           2.0.0
sentencepiece       0.1.96
setuptools          57.0.0
sgmllib3k           1.0.0
six                 1.16.0
smart-open          5.1.0
spacy               3.1.4
spacy-alignments    0.8.3
spacy-legacy        3.0.8
spacy-transformers  1.0.6
srsly               2.4.2
starlette           0.14.2
stringcase          1.2.0
terminado           0.10.1
testpath            0.5.0
thinc               8.0.12
title2bib           0.4.1
tokenizers          0.10.3
toolz               0.11.1
torch               1.9.0
torchcontrib        0.0.2
tornado             6.1
tqdm                4.61.1
traitlets           4.3.3
transformers        4.9.2
typer               0.3.2
typing-extensions   3.10.0.0
Unidecode           1.2.0
urllib3             1.26.6
uvicorn             0.13.4
wasabi              0.8.2
wcwidth             0.2.5
webencodings        0.5.1
websockets          8.1
widgetsnbextension  3.5.1
xlrd                2.0.1
zipp                3.4.1

Here is the logging information when executing the code. BTW, I got a research license in summer 2021, it might have been expired now. Is it the reason that caused the issue? Sorry, I am new to Prodigy.

(venv) C:\Users\Jayshow\PycharmProjects\prodigy>python -m prodigy textcat.manual sentscat_algoFunc_demo3 ./algorithmSe
nts_demo.jsonl --label MENTION,USE-COMPARE
21:38:40: INIT: Setting all logging levels to 10
21:38:40: RECIPE: Calling recipe 'textcat.manual'
Using 2 label(s): MENTION, USE-COMPARE
21:38:40: RECIPE: Starting recipe textcat.manual
{'exclude': None, 'exclusive': False, 'label': ['MENTION', 'USE-COMPARE'], 'loader': None, 'source': './algorithmSents
_demo.jsonl', 'dataset': 'sentscat_algoFunc_demo3'}

21:38:40: RECIPE: Annotating with 2 labels
['MENTION', 'USE-COMPARE']

21:38:40: LOADER: Using file extension 'jsonl' to find loader
./algorithmSents_demo.jsonl

21:38:40: LOADER: Loading stream from jsonl
21:38:40: LOADER: Rehashing stream
21:38:40: CONFIG: Using config from global prodigy.json
C:\Users\Jayshow\.prodigy\prodigy.json

21:38:40: VALIDATE: Validating components returned by recipe
21:38:40: CONTROLLER: Initialising from recipe
{'before_db': None, 'config': {'labels': ['MENTION', 'USE-COMPARE'], 'choice_style': 'multiple', 'choice_auto_accept':
 False, 'exclude_by': 'input', 'auto_count_stream': True, 'dataset': 'sentscat_algoFunc_demo3', 'recipe_name': 'textca
t.manual', 'theme': 'basic', 'custom_theme': {}, 'buttons': ['accept', 'reject', 'ignore', 'undo'], 'batch_size': 10,
'history_size': 10, 'port': 8082, 'host': 'localhost', 'cors': True, 'db': 'sqlite', 'db_settings': {}, 'api_keys': {}
, 'validate': True, 'auto_exclude_current': True, 'instant_submit': False, 'feed_overlap': True, 'ui_lang': 'en', 'pro
ject_info': ['dataset', 'session', 'lang', 'recipe_name', 'view_id', 'label'], 'show_stats': False, 'hide_meta': False
, 'show_flag': False, 'instructions': False, 'swipe': False, 'split_sents_threshold': False, 'html_template': False, '
global_css': None, 'javascript': None, 'writing_dir': 'ltr', 'show_whitespace': False}, 'dataset': 'sentscat_algoFunc_
demo3', 'db': True, 'exclude': None, 'get_session_id': None, 'metrics': None, 'on_exit': None, 'on_load': None, 'progr
ess': <prodigy.components.progress.ProgressEstimator object at 0x0000029E8594BF98>, 'self': <prodigy.core.Controller o
bject at 0x0000029E8594BAC8>, 'stream': <generator object at 0x0000029E85971340>, 'update': None, 'validate_answer': N
one, 'view_id': 'choice'}

21:38:40: VALIDATE: Creating validator for view ID 'choice'
21:38:40: VALIDATE: Validating Prodigy and recipe config
21:38:40: CONFIG: Using config from global prodigy.json
C:\Users\Jayshow\.prodigy\prodigy.json

21:38:40: DB: Initializing database SQLite
21:38:40: DB: Connecting to database SQLite
21:38:40: DB: Creating dataset '2022-01-01_21-38-40'
{'created': datetime.datetime(2022, 1, 1, 15, 55, 37)}

21:38:40: FEED: Initializing from controller
{'auto_count_stream': True, 'batch_size': 10, 'dataset': 'sentscat_algoFunc_demo3', 'db': <prodigy.components.db.Datab
ase object at 0x0000029E8594BFD0>, 'exclude': ['sentscat_algoFunc_demo3'], 'exclude_by': 'input', 'max_sessions': 10,
'overlap': True, 'self': <prodigy.components.feeds.Feed object at 0x0000029E85963D30>, 'stream': <generator object at
0x0000029E85971340>, 'target_total_annotated': None, 'timeout_seconds': 3600, 'total_annotated': 0, 'total_annotated_b
y_session': Counter(), 'validator': <prodigy.components.validate.Validator object at 0x0000029E8594B898>, 'view_id': '
choice'}

21:38:40: PREPROCESS: Add multiple choice options for 2 labels
21:38:40: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x0000029E859710E0>, 'warn_fn': <bound method Print
er.warn of <wasabi.printer.Printer object at 0x0000029E84D64240>>, 'warn_threshold': 0.4}

21:38:40: FILTER: Filtering out empty examples for key 'text'
21:38:40: CORS: initialized with wildcard "*" CORS origins

✨  Starting the web server at http://localhost:8082 ...
Open the app in your browser and start annotating!

INFO:     Started server process [2956]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8082 (Press CTRL+C to quit)
INFO:     ::1:62230 - "GET / HTTP/1.1" 200 OK
INFO:     ::1:62230 - "GET /bundle.js HTTP/1.1" 200 OK
21:42:39: GET: /project
{'labels': ['MENTION', 'USE-COMPARE'], 'choice_style': 'multiple', 'choice_auto_accept': False, 'exclude_by': 'input',
 'auto_count_stream': True, 'dataset': 'sentscat_algoFunc_demo3', 'recipe_name': 'textcat.manual', 'theme': 'basic', '
custom_theme': {}, 'buttons': ['accept', 'reject', 'ignore', 'undo'], 'batch_size': 10, 'history_size': 10, 'port': 80
82, 'host': 'localhost', 'cors': True, 'db': 'sqlite', 'validate': True, 'auto_exclude_current': True, 'instant_submit
': False, 'feed_overlap': True, 'ui_lang': 'en', 'project_info': ['dataset', 'session', 'lang', 'recipe_name', 'view_i
d', 'label'], 'show_stats': False, 'hide_meta': False, 'show_flag': False, 'instructions': False, 'swipe': False, 'spl
it_sents_threshold': False, 'html_template': False, 'global_css': None, 'javascript': None, 'writing_dir': 'ltr', 'sho
w_whitespace': False, 'view_id': 'choice', 'version': '1.11.6'}

INFO:     ::1:62230 - "GET /project HTTP/1.1" 200 OK
21:42:39: POST: /get_session_questions
21:42:39: CONTROLLER: Getting batch of questions for session: None
21:42:39: FEED: Finding next batch of questions in stream
⚠ The running recipe is configured for multiple annotators using named sessions
with feed_overlap=True, but a client is requesting questions using the default
session. For this recipe, open the app with ?session=name added to the URL or
set feed_overlap to False in your configuration.
INFO:     ::1:62230 - "POST /get_session_questions HTTP/1.1" 400 Bad Request
INFO:     ::1:62333 - "GET / HTTP/1.1" 200 OK
INFO:     ::1:62333 - "GET /bundle.js HTTP/1.1" 200 OK
21:48:11: GET: /project
{'labels': ['MENTION', 'USE-COMPARE'], 'choice_style': 'multiple', 'choice_auto_accept': False, 'exclude_by': 'input',
 'auto_count_stream': True, 'dataset': 'sentscat_algoFunc_demo3', 'recipe_name': 'textcat.manual', 'theme': 'basic', '
custom_theme': {}, 'buttons': ['accept', 'reject', 'ignore', 'undo'], 'batch_size': 10, 'history_size': 10, 'port': 80
82, 'host': 'localhost', 'cors': True, 'db': 'sqlite', 'validate': True, 'auto_exclude_current': True, 'instant_submit
': False, 'feed_overlap': True, 'ui_lang': 'en', 'project_info': ['dataset', 'session', 'lang', 'recipe_name', 'view_i
d', 'label'], 'show_stats': False, 'hide_meta': False, 'show_flag': False, 'instructions': False, 'swipe': False, 'spl
it_sents_threshold': False, 'html_template': False, 'global_css': None, 'javascript': None, 'writing_dir': 'ltr', 'sho
w_whitespace': False, 'view_id': 'choice', 'version': '1.11.6'}

INFO:     ::1:62333 - "GET /project HTTP/1.1" 200 OK
21:48:11: POST: /get_session_questions
21:48:11: CONTROLLER: Getting batch of questions for session: None
21:48:11: FEED: Finding next batch of questions in stream
⚠ The running recipe is configured for multiple annotators using named sessions
with feed_overlap=True, but a client is requesting questions using the default
session. For this recipe, open the app with ?session=name added to the URL or
set feed_overlap to False in your configuration.
INFO:     ::1:62333 - "POST /get_session_questions HTTP/1.1" 400 Bad Request
INFO:     ::1:62333 - "GET /favicon.ico HTTP/1.1" 200 OK
INFO:     ::1:61782 - "GET / HTTP/1.1" 200 OK
INFO:     ::1:61782 - "GET /bundle.js HTTP/1.1" 200 OK
22:07:10: GET: /project
{'labels': ['MENTION', 'USE-COMPARE'], 'choice_style': 'multiple', 'choice_auto_accept': False, 'exclude_by': 'input',
 'auto_count_stream': True, 'dataset': 'sentscat_algoFunc_demo3', 'recipe_name': 'textcat.manual', 'theme': 'basic', '
custom_theme': {}, 'buttons': ['accept', 'reject', 'ignore', 'undo'], 'batch_size': 10, 'history_size': 10, 'port': 80
82, 'host': 'localhost', 'cors': True, 'db': 'sqlite', 'validate': True, 'auto_exclude_current': True, 'instant_submit
': False, 'feed_overlap': True, 'ui_lang': 'en', 'project_info': ['dataset', 'session', 'lang', 'recipe_name', 'view_i
d', 'label'], 'show_stats': False, 'hide_meta': False, 'show_flag': False, 'instructions': False, 'swipe': False, 'spl
it_sents_threshold': False, 'html_template': False, 'global_css': None, 'javascript': None, 'writing_dir': 'ltr', 'sho
w_whitespace': False, 'view_id': 'choice', 'version': '1.11.6'}

INFO:     ::1:61782 - "GET /project HTTP/1.1" 200 OK
22:07:11: POST: /get_session_questions
22:07:11: CONTROLLER: Getting batch of questions for session: None
22:07:11: FEED: Finding next batch of questions in stream
⚠ The running recipe is configured for multiple annotators using named sessions
with feed_overlap=True, but a client is requesting questions using the default
session. For this recipe, open the app with ?session=name added to the URL or
set feed_overlap to False in your configuration.
INFO:     ::1:61782 - "POST /get_session_questions HTTP/1.1" 400 Bad Request
INFO:     ::1:61782 - "GET /favicon.ico HTTP/1.1" 200 OK

Based on your screenshot, it looks like the server isn't running correctly, but there seems to be no problems in your logs and the server is active. Are you serving Prodigy locally on localhost or on a server? Is the server still running or does it just exit at the end? If so, did you double-check that you're not running out of memory (since you're using a transformer model in the loop)?

If you only got your license in 2021, it'll sill be active and it won't have any effect on the runtime usage. Prodigy never connects to our servers at runtime, the license will only affect whether you can download it or not.

Thanks for your quick response @ines . I am serving Prodigy locally on localhost on my laptop.

Sorry, how could I know whether the server is still running or exit at the end?

I think the memory is ok enough to run the code. I changed the model to en_core_web_sm, and I got the same issue as what I posted above. So weird!

Hi @ines. Just have a good message for you. I created a new virtualenv and install spaCy and Prodigy from scratch, and now it really works well. I attached the version information below, hope it can help people who have the same issue as me.

python==3.8
Prodigy==1.10.3
spaCy ==2.3.2
1 Like

That's great to know, maybe you just ended up in a weird environment state!

Just answering this for completeness: if your command line prompt (where you can type in stuff, usually indicated by a blinking cursor) comes back, this usually indicates that the server has stopped. If not, this means the process is still running.