Hi,
I started digging into PhraseMatcher
and have a question.
We are bringing in annotations for our project (annotating EMS motor vehicle crash reports) from another program (brat) into Prodigy. As a test, I converted all our entity annotations from a single report into seed terms, as in these examples:
{"label": "EMSRunNumber", "text": "AC71231"}
{"label": "Age", "text": "52 Years"}
{"label": "Gender", "text": "Male"}
{"label": "InsuranceStatus", "text": "UNKNOWN"}
{"label": "Subject", "text": "PT"}
{"label": "DriverPassengerStatus", "text": "DRIVING"}
{"label": "VehicleSpeed", "text": "HWY SPEEDS"}
{"label": "Negation", "text": "NOT"}
{"label": "SeatbeltPresence", "text": "BELTED"}
{"label": "OtherSevere", "text": "UNKNOWN WHAT HAPPENED BUT PT WENT INTO A YARD ENDED UP ELEVATED ON DEBRIB"}
{"label": "Rollover", "text": "DRIVERSIDE TOWARDS GROUND ON ITS SIDE AGAINST A TREE"}
{"label": "SeverityIntrusion", "text": "MAJOR DAMAGE TO VEHICLE"}
{"label": "LocIntrusion", "text": "ESPECIALLY DRIVERS SIDE"}
{"label": "SeverityIntrusion", "text": "LARGE AMOUNT"}
{"label": "LocIntrusion", "text": "COMPARTMENT INTRUSION ON DRIVERS SIDE"}
{"label": "Negation", "text": "NO"}
{"label": "AirbagPresence", "text": "AIRBAG DEPLOYMENT"}
{"label": "ProvidersScene", "text": "EMS ON SCENE"}
As you can see, the text of these are multi-word tokens, so I followed this thread train-a-new-ner-entity-with-multi-word-tokens. As @ines suggested, I read these in using db-in
and then using terms.to-patterns
I wrote out the data to a jsonl file, which looks like:
{"label":null,"pattern":[{"lower":"AC71231"}]}
{"label":null,"pattern":[{"lower":"52 Years"}]}
{"label":null,"pattern":[{"lower":"Male"}]}
{"label":null,"pattern":[{"lower":"UNKNOWN"}]}
{"label":null,"pattern":[{"lower":"PT"}]}
{"label":null,"pattern":[{"lower":"DRIVING"}]}
{"label":null,"pattern":[{"lower":"HWY SPEEDS"}]}
{"label":null,"pattern":[{"lower":"NOT"}]}
{"label":null,"pattern":[{"lower":"BELTED"}]}
{"label":null,"pattern":[{"lower":"UNKNOWN WHAT HAPPENED BUT PT WENT INTO A YARD ENDED UP ELEVATED ON DEBRIB"}]}
{"label":null,"pattern":[{"lower":"DRIVERSIDE TOWARDS GROUND ON ITS SIDE AGAINST A TREE"}]}
{"label":null,"pattern":[{"lower":"MAJOR DAMAGE TO VEHICLE"}]}
{"label":null,"pattern":[{"lower":"ESPECIALLY DRIVERS SIDE"}]}
{"label":null,"pattern":[{"lower":"LARGE AMOUNT"}]}
{"label":null,"pattern":[{"lower":"COMPARTMENT INTRUSION ON DRIVERS SIDE"}]}
{"label":null,"pattern":[{"lower":"NO"}]}
{"label":null,"pattern":[{"lower":"AIRBAG DEPLOYMENT"}]}
{"label":null,"pattern":[{"lower":"EMS ON SCENE"}]}
I understand that these won't have labels, since I did not specify the --label
switch when using terms.to-patterns
.
I guess my question is, when I have multiple labels like this, is there a hack I can do to just pull the label from the database? The labels are there, as per the output from db-out
:
{"label":"EMSRunNumber","text":"AC71231","_input_hash":728392859,"_task_hash":2944968,"answer":"accept"}
{"label":"Age","text":"52 Years","_input_hash":286403082,"_task_hash":-1910933207,"answer":"accept"}
{"label":"Gender","text":"Male","_input_hash":1541676315,"_task_hash":540860021,"answer":"accept"}
{"label":"InsuranceStatus","text":"UNKNOWN","_input_hash":1398767958,"_task_hash":-2052141552,"answer":"accept"}
So, this does not seem to be such an edge case wanting to have multiple labels extracted from the data using terms.to-patterns
.
Is the above one of the use cases that EntityRuler
will cover?
I have a slight time crunch to get this part of my experiment done by mid-October (specifically, extracting patterns from a hundred or so annotated reports and then refining them and testing these in Prdigy/spaCy. I'll then compare the results to that from several other NLP engines), so I am looking for the easiest route to get results.
So far, Prodigy has been fairly straightforward to use, but if circumventing this by scripting out my own pattern files and then using them in spaCy 2.1.x to take advantage of the EntityRuler
would yield quicker results, then I will certainly do that.
Thank you for your input!
Greg--