Compatibility of versions

Hi @ines,
I am doing preliminary research into methodologies for an experiment we are running, and it looks like since we want to tag phrases as custom entities that the EntityRuler is the way to go. We JUST downloaded Prodigy and have begun digging in. One thing I noticed was that this build of Prodigy (1.5.1) comes with spaCy 2.0.12. However, from the spaCy github versions, I noticed that v2.1.0 is still in prerelease a1. Is this fully compatible (minus any unknown bugs/features) with Prodigy? Or, for the mean time to expedite the process in a timely manner, would you suggest just using the new pipeline component directly through spaCy?

Also, is there a documentation page somewhere for EntityRuler? I have not been able to find one.

I am trying to consolidate the methodology we will need to use, so I will be asking some followup questions wrt to training a gold standard corpus we have in the works using the EntityRuler (specifically involving your response, here: NER or PhraseMatcher?).

As alway, much appreciated!

Greg–

Hi,
I started digging into PhraseMatcher and have a question.

We are bringing in annotations for our project (annotating EMS motor vehicle crash reports) from another program (brat) into Prodigy. As a test, I converted all our entity annotations from a single report into seed terms, as in these examples:

{"label": "EMSRunNumber", "text": "AC71231"}
{"label": "Age", "text": "52 Years"}
{"label": "Gender", "text": "Male"}
{"label": "InsuranceStatus", "text": "UNKNOWN"}
{"label": "Subject", "text": "PT"}
{"label": "DriverPassengerStatus", "text": "DRIVING"}
{"label": "VehicleSpeed", "text": "HWY SPEEDS"}
{"label": "Negation", "text": "NOT"}
{"label": "SeatbeltPresence", "text": "BELTED"}
{"label": "OtherSevere", "text": "UNKNOWN WHAT HAPPENED BUT PT WENT INTO A YARD ENDED UP ELEVATED ON DEBRIB"}
{"label": "Rollover", "text": "DRIVERSIDE TOWARDS GROUND ON ITS SIDE AGAINST A TREE"}
{"label": "SeverityIntrusion", "text": "MAJOR DAMAGE TO VEHICLE"}
{"label": "LocIntrusion", "text": "ESPECIALLY DRIVERS SIDE"}
{"label": "SeverityIntrusion", "text": "LARGE AMOUNT"}
{"label": "LocIntrusion", "text": "COMPARTMENT INTRUSION ON DRIVERS SIDE"}
{"label": "Negation", "text": "NO"}
{"label": "AirbagPresence", "text": "AIRBAG DEPLOYMENT"}
{"label": "ProvidersScene", "text": "EMS ON SCENE"}

As you can see, the text of these are multi-word tokens, so I followed this thread train-a-new-ner-entity-with-multi-word-tokens. As @ines suggested, I read these in using db-in and then using terms.to-patterns I wrote out the data to a jsonl file, which looks like:

{"label":null,"pattern":[{"lower":"AC71231"}]}
{"label":null,"pattern":[{"lower":"52 Years"}]}
{"label":null,"pattern":[{"lower":"Male"}]}
{"label":null,"pattern":[{"lower":"UNKNOWN"}]}
{"label":null,"pattern":[{"lower":"PT"}]}
{"label":null,"pattern":[{"lower":"DRIVING"}]}
{"label":null,"pattern":[{"lower":"HWY SPEEDS"}]}
{"label":null,"pattern":[{"lower":"NOT"}]}
{"label":null,"pattern":[{"lower":"BELTED"}]}
{"label":null,"pattern":[{"lower":"UNKNOWN WHAT HAPPENED BUT PT WENT INTO A YARD ENDED UP ELEVATED ON DEBRIB"}]}
{"label":null,"pattern":[{"lower":"DRIVERSIDE TOWARDS GROUND ON ITS SIDE AGAINST A TREE"}]}
{"label":null,"pattern":[{"lower":"MAJOR DAMAGE TO VEHICLE"}]}
{"label":null,"pattern":[{"lower":"ESPECIALLY DRIVERS SIDE"}]}
{"label":null,"pattern":[{"lower":"LARGE AMOUNT"}]}
{"label":null,"pattern":[{"lower":"COMPARTMENT INTRUSION ON DRIVERS SIDE"}]}
{"label":null,"pattern":[{"lower":"NO"}]}
{"label":null,"pattern":[{"lower":"AIRBAG DEPLOYMENT"}]}
{"label":null,"pattern":[{"lower":"EMS ON SCENE"}]}

I understand that these won't have labels, since I did not specify the --label switch when using terms.to-patterns.

I guess my question is, when I have multiple labels like this, is there a hack I can do to just pull the label from the database? The labels are there, as per the output from db-out:

{"label":"EMSRunNumber","text":"AC71231","_input_hash":728392859,"_task_hash":2944968,"answer":"accept"}
{"label":"Age","text":"52 Years","_input_hash":286403082,"_task_hash":-1910933207,"answer":"accept"}
{"label":"Gender","text":"Male","_input_hash":1541676315,"_task_hash":540860021,"answer":"accept"}
{"label":"InsuranceStatus","text":"UNKNOWN","_input_hash":1398767958,"_task_hash":-2052141552,"answer":"accept"}

So, this does not seem to be such an edge case wanting to have multiple labels extracted from the data using terms.to-patterns.

Is the above one of the use cases that EntityRuler will cover?

I have a slight time crunch to get this part of my experiment done by mid-October (specifically, extracting patterns from a hundred or so annotated reports and then refining them and testing these in Prdigy/spaCy. I'll then compare the results to that from several other NLP engines), so I am looking for the easiest route to get results.

So far, Prodigy has been fairly straightforward to use, but if circumventing this by scripting out my own pattern files and then using them in spaCy 2.1.x to take advantage of the EntityRuler would yield quicker results, then I will certainly do that.

Thank you for your input!

Greg--

In general, we make sure that Prodigy is always compatible with stable spaCy versions. You can obviously try and use it with Prodigy, but I'd only recommend it for experimental purposes. (Also, remember that spacy-nightly versions usually require new models.)

But for your use case, I'm not even sure you need to use Prodigy with the alpha version of spaCy? You can still collect your annotations with the current stable version, and then use the match patterns or data to train

Patterns like this are problematic, because as I've explained in the thread you linked, this one will never match. The following will look for one token whose lowercase matches "HWY SPEEDS". This will never be the case, since the string will be split into two tokens: ['HWY', 'SPEEDS'].

Instead, your patterns can either reflect the tokenization, or you can write exact string match patterns instead:

{"label":null,"pattern":"HWY SPEEDS"}

For your use case, it sounds like you probably just want to write your own converter script that takes your annotations and outputs the patterns. Basically, something similar to the script I describe at the bottom of this post. This will also let you incorporate the patterns automatically. If you look at the source of terms.to-patterns, you'll see that it doesn't really do anything magicaly at all – it's just a convenience helper function. All you want to do here it take one data format and convert it to a different one – how you do this is up to you. (You don't even have to use Python if there's a different language you prefer!)

Just to make sure I understand your use case correctly: Do you want to just find exact string matches in your text and label them, or also train a model to generalise based on those strings and find similar occurrences in context?

Hi @ines,
Thanks for the detailed answer and confirming what I suspected. I’ll work on the converter and move on directly with spaCy.

Would the patterns for EntityRuler be similar to those discussed here: https://github.com/explosion/spaCy/issues/1971 or https://spacy.io/usage/linguistic-features#adding-patterns-attributes, but with a label for the entity name?

At this immediate point, we are only looking for exact, and some more fuzzy string matches using more complex matching patterns, but, we at some point will definitely will want to train a model, along the lines of your response to this thread NER or PhraseMatcher?.

Since this is evolving into a spaCy usage thread, I will post relevant questions to StackOverflow.

And Python is my language of choice… so all is good! :grinning:

Cheers!

Now that I understand how tokens and dictionaries are used in pattern matching, I’m trying more complex match patterns, like {"label": "Age", "pattern":[{"IS_DIGIT":true}, {"LOWER": "years"}]}, but not getting any results, even though the text, 52 Years is in my document. {"IS_DIGIT":true} and {"IS_DIGIT":true, "IS_SHAPE":"dd"} both work as expected.

Using ner.match I tried the pattern {"LOWER":"years"} and got no results.

I am not sure why this is not working. And oddly, an exact match, ala {"label": "Age", "pattern": "Years"} does not work either.

Never mind! It’s all good. In my document there was a pipe symbol, |, after 52 Years and no space. Ha! :rofl:

1 Like

Ha, no worries – that’s good to know though! I was really puzzled by this, because it’s one of those issues that’s not impossible, but really really unlikely.