What would be a good approach to train a NER model to recognize random strings

Hello there,

I am currently evaluating Prodigy (thanks Walter :slight_smile: ) and testing out if it can help me to build a NER model that recognizes entities in my specific domain.

The sentences are in average 75 characters long and rather more like a "product description" than a gramatically qualified sentence with nouns, objects, adjectives, etc..

To start the experiment, I created a patterns file that contains both specific terms as well as wildcard rules that match measurements (e.g. 100 kg, 50 gramms, etc.).

Example of the labels:

  • {"label": "COLOR", "pattern": "green"}
  • {"label": "COLOR", "pattern": "blue"}
  • ...
  • {"label": "KG", "pattern": [{"LIKE_NUM": true, "OP": "?"}, {"LOWER": "kg"}]}

Then, I started the ner.teach recipy like this

prodigy ner.teach prodigy_test7 en_core_web_md ./prodigy_test_products.jsonl --label COLOR,KG --patterns ./prodigy_test_patterns.jsonl

The question
I would also like to train the model on SKUs. These kind of tokens have a characteristic that they are more or less random strings that do not follow any pattern (as they are defined by manufacturers based on their own decisions).

These tokens can be "anything" and they can be anywhere in a sencence.

Token examples

  • 204A-XR-888
  • 475 SSIG 06
  • USR8IBNU4-SE.333K
  • 063 890 4 210

I have thousands of SKUs available as a list, so I can create a patterns file. But will the model be able to generalize on this? It would be great to know beforehand, if this is realistic before starting with annotation.

Just a few minutes ago, I read this thread https://support.prodi.gy/t/help-first-process-of-annotation/4156/2 where Ines writes:

When using NER, make sure that your entity types still follow the same conceptual idea of "named entities", otherwise your model might struggle to learn them efficiently. They don't have to be PERSON or ORG , but they should work in a similar way and describe distinct expressions like proper nouns with clear boundaries that can be determined from the local context. If that's not the case, a named entity recognition model might not be the right fit for what you're trying to do. Instead, you might want to experiment with a hybrid pipeline of more generic and classic NER labels + a text classification model.

In case this also applies to my NER problem, what does Ines advice (hybrid pipeline) mean in practice? I am not that deep in the topic yet and her advice is still a bit abstract for me.

Thanks for all your help

Hi Jens.

My first response to seeing the SKUs was to consider a regex. The pattern files that Prodigy uses are based on the matcher patterns in spaCy and these totally allow for regex patterns as well! It feels much more pragmatic to use these kinds of pattern techniques than to hope that a NER model can take care of it.

Now about the "hybrid" approach, it seems like some of the entities that you're interested in can be picked up with a ML-based pipeline. In the case of the "color" label, I can imagine that word embeddings can totally help out there. But for your SKUs this may not be not the case. I can't imagine that word embeddings trained on Wikipedia data will have learned anything that's relevant to detecting an SKU. That suggests that, in order to be pragmatic, it's fine to split up the problem. You can have a neural network for some of the labels and a pattern matcher for some of the other ones in a single spaCy pipeline.

Does this help?

Hi Vincent, yes this clarifies it to me. Thanks for your explanation. Though, I find it quite hard to write a regex that will actually cover all possible variations of a SKU (a vendor's phantasy is endless).

Thinking of a creative way to identify those SKU tokens...

Do you have a long list of pre-existing SKUs? If so, you can use it to test the effectiveness of your patterns.

Note that patterns in spaCy allow you to use a regex, but it's a bit more than just that. There's a few patterns top of mind that feel simple enough that you might want to consider:

  • Do we have 3-5 tokens after each other that are all numbers? If so, it's probably a SKU.
  • Do we have 3-5 tokens after each other that are all split with a dask? If so, it's probably a SKU.
  • Do we have a sequence of 3-4 tokens that start and end with number tokens of size at least 2? If so, it's probably a SKU.

You'll likely have a lot of these patterns. But it feels like the simplest starting point. If you're looking for an interactive environment to toy around with spaCy patterns, you may appreciate this interactive demo.