Error while using ner.match for pattern matching

Hi there
I am trying to match a list of patterns to text. My pattern file looks like this:

{“label”: “Action”, “pattern”: [{“lower”: “RIH”}]}
{“label”: “Equipment”, “pattern”: [{“lower”: “ps-21”}, {“lower”: “slips”}]}

My input CSV looks like this:

id,text
65ff62f85f222e98ac292682c4f7eee8,Installed PS21 slips. RU auto pipe handling system.
50fb94effa05495d39636894121086bc,Installed RST toolstring into lubricator.
466dd1bf0dd8ed7661ff4bcb4cf5fde9,Installed Riser centralizer in tension deck.
cabadddd5a88eaf89ba6f134f52fcca2,Installed Slick joint and Diverter.
256fae0b8ee0410824d15f9d62deb3f4,“Installed Spider. Time Wind speed Wind direction Sea Wave direction Knots deg m deg 21:30 4/8 160° 1,7/2,7 350°”
093e2f523b1b01d54a731cf725ec5b19,“Installed PS-21 slips. Continued RIH w/ 7"” liner on 5½"" DP from 487m to 3036m, filling every 5th stand. Entered 9 5/8"" liner @ 2404m without any obstructions."

I used this command:
prodigy ner.match sample_dataset en_core_web_sm my_csv.csv --patterns sample_patterns.jsonl

I end up with these problems:

  1. It starts with the last input sentence and tags only “PS-21 slips” for equipment. RIH is not identified even if it exists in the text. (Changed case also and checked it does not come up).
  2. it finishes up with that one sentence. and i get “no tasks available”. if i delete that line of the file and run it again, it gives an error. The error message looks like this:

Traceback (most recent call last):
File “cython_src/prodigy/core.pyx”, line 55, in prodigy.core.Controller.init
File “/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/toolz/itertoolz.py”, line 368, in first
return next(iter(seq))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(*args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “cython_src/prodigy/core.pyx”, line 60, in prodigy.core.Controller.init
ValueError: Error while validating stream: no first batch. This likely means that your stream is empty.

I get the same error when i try to use any other files. I am stuck here. Could you help me on how to proceed?

By changed case, did you mean you also tested it with "lower": "rih"? With your current patterns, it makes sense that the following will never match:

{"label": "Action", "pattern": [{"lower": "RIH"}]}

This pattern is looking for a token whose lowercase form matches "RIH". This will never be true, since that string is uppercase.

Based on the data you posted and the two patterns, this would make sense: Since the first pattern can never match, and the second one ("PS-21 slips") only occurs once in the data, you’ll only see one example. If you delete the example containing that span, you’ll get no matches at all.

That error message means that there are no examples in your stream that can be sent out for annotation. In your case, that’s because ner.match will look for pattern matches only, and if there are none, Prodigy will raise an error because the stream is empty and there’s nothing to annotate.

If you want to use ner.match, I’d suggest writing more patterns and trying out different variations and maybe more general token descriptions – maybe using the SHAPE or IS_UPPER attribute? You know the data and domain best, so maybe you can find attributes that commonly indicate that tokens might be something you’re interested in. For example, maybe you’ll find that most ACTION entities are single tokens consisting of only capital letters, so your pattern could be [{"IS_UPPER": rrue}]. It’s fine if your patterns produce false positives, since you’ll be accepting / rejecting the suggestions anyways.

Yes. I had tried with “RIH” as a lower case entry too. also the {“upper”:“RIH”} pattern. basically all the ncn combinations :slight_smile: ) I have more patterns - will try them out too.

What i am basically doing is to do a case insensitive match for a list of terms. I am using PhraseMatcher of spaCy directly after tokenizing. I have a list of 3000 terms. I created Upper, lower, title case, capitalized versions of all the terms and using phrase match. But still terms like “VamTop” do not match because it has a capitalization inbetween the word.

I tried to see if there is a possibility in prodigy to do a case insensitive phrase match. I have a list of 3000 phrases to be matched with case-insensitive. (i created the list of patterns too) the above is just a couple of examples. Can i do the ‘real’ case insensitive match without creating all case possibilities for a single word?

upper won’t work, because that’s not an existing token attribute. But "lower": "rih" should match as expected. You can also test this in the interactive demo.

You could do that by using the lower token attributes with the lowercased form of the words and the regular rule-based Matcher. The only thing to keep in mind is that one dictionary represents one token – so if your terminology list includes multi-token phrases, you should ideally use spaCy to tokenize them to make sure they’ll match. For example:

nlp = spacy.load('en_core_web_sm')

TERMINOLOGY_LIST = ['term', 'another term']  # etc.
patterns = []

for doc in nlp.pipe(TERMINOLOGY_LIST):
    tokens = [token.lower_ for token in doc]
    pattern = [{'LOWER': token} for token in tokens]
    patterns.append(pattern)

Yah, if you look at my previous example: this is what my multi word token looks like:
I hope this is ok?

Yes, that’s perfect! :+1:

Thank you very much.
Your responses are real quick and very helpful. I appreciate it :slight_smile:

1 Like

Yey! This works :woman_genie: . Thank you @ines.
I have 2 questions following this:

  1. is there a way tho write these matches to a jsonl file without getting to the interface? (I do not see a “–output” argument.)

  2. I would like to remove the overlapping spans and keep only the longest span. For example on the interface I get “PS-21” suggested once and “PS-21 slips” once and “slips” once. I just want to keep “PS-21 slips”.

From spacy matching i checked the output jsonl file and and removed the overlapping spans with writing a function. is there something that will consider only the longest match?

Glad it'S working! Moving to the next thread, answer see here:

1 Like