Patterns with dependency annotation

Hello,

I am using prodigy to train an custom NER pipeline, and am having difficulty using the EntityRuler to write a pattern to pre-annotate product information in a dataset with product information contained in a JSONL like the below:

Product Description: Light Blue Jeans
Brand: Blue Lion
Size: 32x34
Color: Stonewash

Terminal Command:

prodigy ner.manual desc_data blank:en ./productdescription.jsonl --highlight-chars --label ./labels.txt --patterns ./patterns.jsonl

patterns.jsonl (This is the file that needs to be corrected):

{"label":"COLOR","pattern":[{"lower":"Color"},{"ORTH":":"}]}

Is there a way to write a pattern in this workflow to identify the information that follows each product attribute (Brand, Size, Color) etc, based on the identifier that precedes each attribute?

For example, I would like to tag "Stonewash" as an entity "COLOR" (and not include the "Color:" information that precedes it? I believe this can be accomplished using dependencies, but was unsure how to implement into this workflow.

Thank you!

Hi there!

One comment before diving deeper

There may be a typo in your pattern.

{"label":"COLOR","pattern":[{"lower":"Color"},{"ORTH":":"}]}

Notice that part with lower in it? It's saying that the lower case text must be equal to Color. Note that capital C there. No lower case text can ever match an upper case, so I think that might be an issue. You probably meant this pattern:

{"label":"COLOR","pattern":[{"lower":"color"},{"ORTH":":"}]}

On to the issue

I'd like to know a bit more about your problem here. Is the data following a specific structure all of the time? If so, you might not need the patternmatcher and you can just use string matching techniques directly. If you read examples like this line by line:

Product Description: Light Blue Jeans
Brand: Blue Lion
Size: 32x34
Color: Stonewash

then a regex might just suffice.

Another avenue to consider is that you can also generate patterns by hand that ought to work. The size seems pretty well structured, so that could be matched via a regex, something like:

{"label":"size", "pattern": [{"TEXT": {"REGEX": "((\d)*x(\d)*)\w+"}]}

This should catch the pattern of <DIGITS>x<DIGITS>. The other patterns don't seem like great candidates for a regex, but instead feel like they could be enumerated. That feels like a great use-case for the parse library. You can have it parse the substrings of interest quite easily. Here's an example:

text = """
Product Description: Light Blue Jeans
Color: Red
Brand: Blue Lion
Size: 32x34
Color: Stonewash
Color: Yellow
"""

from parse import compile

# Define what we are interested in parsing
p = compile("Color: {color}\n")

# Define a generator that can generate texts. 
colors = ({"text": r.named['color']} for r in p.findall(text))

# These are the results
list(colors) # [{'text': 'Red'}, {'text': 'Stonewash'}, {'text': 'Yellow'}]

The idea here is that you might run such a script over all the files that you have to get a list of known colors and to just those in the patternmatcher. You could use the same trick for "Product Description" and "Brand" too.

This is just a suggestion though, based on the first impression in this ticket. It seems like a reasonable approach because this way you'll end up with a set of patterns that can also detect the entities even if they don't appear in such a structured format. If there are worthwhile details that I am glancing over I'll gladly hear it, so feel free to follow up :slightly_smiling_face:!

Thanks. I had a similar thought with the Regex approach, and think that a regex would be the most robust for handling data and text that the matcher has not seen. The parse library method and creating a dictionary of patterns may not be able to handle unknown examples, especially for attributes like Brand, where new brands could be added for example.

With the regex approach, I think that is on the right track, but there may be some issues that need to be addressed. My understanding is that Spacy will search for the patterns on a token by token basis, so the \w+ component will not return anything (since Spacy does not see the next token when searching). Would there be a way to customize the solution to address this? Ideally, the solution searches for a more limited universe of tokens (e.g. Color: , color:, etc), and then returns the token immediately to the right of it. I am finding this difficult to accomplish using just patterns.

Thanks again for the assistance.

Ideally, the solution searches for a more limited universe of tokens (e.g. Color: , color:, etc), and then returns the token immediately to the right of it.

Are there colors that require two tokens? I can imagine "Navy Blue", which would require you to grab tokens until you spot a newline. Maybe this is less of a concern for colors, but I can certainly see this being an issue for the brand.

That said, with the parse approach you are able to use more than one search string. You might do something like:

results = [] 
searches = [
    "Color: {color}\n", 
    "color: {color}\n"
]

# Use more than one search pattern over all texts
for search in searches:
    p = compile("Color: {color}\n")
    for text in texts:
        for r in p.findall(text)):
            results.append(r)

I am finding this difficult to accomplish using just patterns.

The patterns allow you to make a definitions of which tokens to include, defining a set of tokens that define a pattern but isn't part of the same pattern isn't support. So alternatively, you could also consider using a some custom spaCy code instead.

I've taken the liberty of writing a small demo of a custom component that uses patterns together with some custom code to select the color.

import srsly 
import spacy
from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Token, Span
from spacy.util import ensure_path

@Language.factory("color_detector")
def create_bad_html_merger(nlp, name):
    return ColorDetector(nlp.vocab)

class ColorDetector:
    def __init__(self, vocab):
        patterns = [
            [{"LOWER": "color"}, {"ORTH": ":"}, {"IS_ALPHA": True, "OP": "*"}],
        ]
        self.matcher = Matcher(vocab)
        self.matcher.add("COLOR", patterns)

    def __call__(self, doc):
        spans = []
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            # notice how I'm skipping over two tokens here?
            spans.append(
                Span(doc, start + 2, end, label="color")
            )
        doc.ents = tuple(spans)
        return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("color_detector", last=True)
doc = nlp("Color: blue")
doc.ents
# blue

There's a big caveat with this approach though: you must feed the nlp pipeline lines instead of the whole text. That's because spaCy ignores newline \n characters. You can still use spaCy as a method to extract colors that can populate a list of strings to detect later though.

Some Lateral thinking

Part of me wonders if there's perhaps an easier way for your use-case though.

  • For sizes you can use a regex.
  • The colors seem enumerable and known upfront. For colors, you can scrape this list from Wikipedia and try to use that as a starting set.
  • For brands, it also happens to be the case that we host a spaCy project about fashion brands here. It even comes with a patterns file!

Wouldn't these be appropriate starting points?

For the product descriptions I might recommend using the parsing trick, if only in the beginning. Once you have enough annotations you might be able to train your first model to pre-highlight the data. I can also imagine that the NER models might work quite well for sizes, colors and brands. But the product description might be better modelled with the spancat model. The overview on the Prodigy docs might help explain why.

1 Like