Ideally, the solution searches for a more limited universe of tokens (e.g. Color: , color:, etc), and then returns the token immediately to the right of it.
Are there colors that require two tokens? I can imagine "Navy Blue", which would require you to grab tokens until you spot a newline. Maybe this is less of a concern for colors, but I can certainly see this being an issue for the brand.
That said, with the parse approach you are able to use more than one search string. You might do something like:
results = []
searches = [
"Color: {color}\n",
"color: {color}\n"
]
# Use more than one search pattern over all texts
for search in searches:
p = compile("Color: {color}\n")
for text in texts:
for r in p.findall(text)):
results.append(r)
I am finding this difficult to accomplish using just patterns.
The patterns allow you to make a definitions of which tokens to include, defining a set of tokens that define a pattern but isn't part of the same pattern isn't support. So alternatively, you could also consider using a some custom spaCy code instead.
I've taken the liberty of writing a small demo of a custom component that uses patterns together with some custom code to select the color.
import srsly
import spacy
from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Token, Span
from spacy.util import ensure_path
@Language.factory("color_detector")
def create_bad_html_merger(nlp, name):
return ColorDetector(nlp.vocab)
class ColorDetector:
def __init__(self, vocab):
patterns = [
[{"LOWER": "color"}, {"ORTH": ":"}, {"IS_ALPHA": True, "OP": "*"}],
]
self.matcher = Matcher(vocab)
self.matcher.add("COLOR", patterns)
def __call__(self, doc):
spans = []
matches = self.matcher(doc)
for match_id, start, end in matches:
# notice how I'm skipping over two tokens here?
spans.append(
Span(doc, start + 2, end, label="color")
)
doc.ents = tuple(spans)
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("color_detector", last=True)
doc = nlp("Color: blue")
doc.ents
# blue
There's a big caveat with this approach though: you must feed the nlp
pipeline lines instead of the whole text. That's because spaCy ignores newline \n
characters. You can still use spaCy as a method to extract colors that can populate a list of strings to detect later though.
Some Lateral thinking
Part of me wonders if there's perhaps an easier way for your use-case though.
- For sizes you can use a regex.
- The colors seem enumerable and known upfront. For colors, you can scrape this list from Wikipedia and try to use that as a starting set.
- For brands, it also happens to be the case that we host a spaCy project about fashion brands here. It even comes with a patterns file!
Wouldn't these be appropriate starting points?
For the product descriptions I might recommend using the parsing trick, if only in the beginning. Once you have enough annotations you might be able to train your first model to pre-highlight the data. I can also imagine that the NER models might work quite well for sizes, colors and brands. But the product description might be better modelled with the spancat model. The overview on the Prodigy docs might help explain why.