manually highlighting using NER manual

Hello. I’m new to prodigy (and NER) and relatively new to python as well. I’m mostly an R programming so don’t hold that against me!

In short, I’m reading in some safety-related text descriptions. The tab-delimited input file contains a column labeled “text” that holds the recorded safety event descriptions. From this, I want to manually highlight terms associated with the entity BUGBITE. I can’t find a lot of information on this so this is what I did…

  1. create a SQLlite table called “my_table”
  2. prodigy ner.manual my_table en_core_web_lg bug_testing.txt --label BUGBITE
  3. prodigy terms.to-patterns my_table my_patterns.jsonl --label BUGBITE

I look at the JSON file (excerpt below) and it does not look how I would expect… Each entity is associated with the entire record, not the individual manually highlighted terms. I assume it is user error on my part, but I can’t tell what I did wrong.

{“label”:“BUGBITE”,“pattern”:[{“lower”:“While heating up bolts on GT9 in preparation to break loose with a Hytorc, the employee noticed irritation on his right knee. The next day the employee reported the potential of receiving a spider or insect bite the previous night.”}]}
{“label”:“BUGBITE”,“pattern”:[{“lower”:“2011-04-05T00:00:00Z\t"While performing a task in the Lube Oil Shed. Technician needed a hand tool from his tool bag. While reaching into his tool bag, he noticed a large black widow nesting in the tools. The tool bag was used the day before and the insect was not there. The Black Widow had entered the tool bag with in a 14 hour span. Not sure if spider came from laying the tool bag on the lube shed floor for a short time or the locker room, were the tool bag is daily stored.”"}]}

Note that when I run ‘’ then I get the below, which is what I would expect from the annotating process (at least based on your tutorials / videos):

{“label”:“BUGBITE”,“pattern”:[{“lower”:“bite”}]}
{“label”:“BUGBITE”,“pattern”:[{“lower”:“bug”}]}
{“label”:“BUGBITE”,“pattern”:[{“lower”:“insect”}]}

Needless to say, nothing seems to work right after these steps, probably due to the incorrect formatting of the JSON patterns…

Any help is much appreciated!

Thanks,
-rich

Hi! Don’t worry, we know we’ve introduced quite a few new concepts with Prodigy, so sorry if this has been confusing.

It looks like the problem here is that terms.to-patterns was designed to convert a terms dataset to patterns – so basically, a dataset of single words created with a recipe like terms.teach (where you use word vectors to find similar terms). So all it really does it take the "text" and use that as the pattern value.

There’s no built-in recipe for converting highlighted spans to patterns, but you can write your own script that takes the exported dataset and creates patterns in the format you need. You can totally do this in R btw – in fact, that type of conversion might actually be easier to do in R than it is in Python. But I’m too bad at R, so I can only give you a Python example :stuck_out_tongue_winking_eye:

The following command will export your dataset as a JSONL file:

prodigy db-out my_table > your_data.jsonl

Here’s a simple version that assumes your highlighted terms only consist of single words:

patterns = []

for eg in examples:
    spans = eg['spans']  # the annotated entities
    text = eg['text']  # the original text
    for span in spans
        start = span['start']  # start of the highlighted span
        end = span['end']  # end of the highlighted span
        span_text = text[start:end]  #  span text 
        pattern = {'label': 'BUGBITE', 'pattern': [{'lower': span_text}] }
        patterns.append(pattern)

If your highlighted spans consist of more than one word, you also need to convert them accordingly, since every dictionary in a pattern describes one token. For example, the result for “cute bug” should look like this:

{"label": "BUGBITE", "pattern": [{"lower": "cute"}, {"lower": "bug"}]}

In most cases, splitting on whitespace should be fine – for example, "cute bug".split(' ') will give you ['cute', 'bug'], which you can then convert to the pattern above. However, for more complex cases, you might want to use spaCy to tokenize the text for you, so you know that your pattern will definitely match.

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("cute-bug!")
token_texts = [token.text for token in doc]
1 Like

Ines. I appreciate the timely response. I’ll definitely give your recommendation a go!!! The difficulty is compounded by my suckiness in python…but I’m getting there.

-rich