NER with commas in the word through ner.correct

Hi!

I'm using the web tool to annotate data and it's great.

Our data can be provided in the following format:
"unit 1,32 fake road"

So I want to pick out the unit and house number and the road as "unit 1", "32", "fake road" for the labels.

But when selecting it in the Web UI, the label is selected for both.

Can we also match partial words through the tool as well? Like if the word starts with/contains? Or is this better to do through a pre-processing/cleaning step?

hi @joshx!

Thanks for your question and welcome to the Prodigy community :wave:

There are a few options:

Option 1: Use character-based highlighting

The ner.manual recipe provides an option for initial labels to do character based highlighting by adding the --highlight-chars argument.

There's two problems with this approach. First, if you didn't do character-based highlighting the first time, you'd need to redo your annotations.

However, the second problem is worse because even if you do character-based annotations, NER models are built for token-based tags, not character. Here's more details:

But as the docs show, this capability is really for other languages (e.g., Chinese) where characters represent tokens, not token-based languages like English for training ner models.

Also, the same post explains why the --highlight-chars isn't available for ner.correct or ner.teach:

Therefore, I would not recommend this approach.

Option 2: Create a modified tokenizer

I would recommend modifying your tokenizer so that you keep your annotations as tokens but that it appropriately tokenizes items as you'd want them. This will require a little knowledge of spaCy's tokenizer but there is documentation. What I would recommend is find a handful of examples that you've noticed the current tokenizer doesn't work. Then create a modified tokenizer that performs to your liking. Save that tokenizer, and then use that tokenizer for all parts of your labeling workflow: ner.manual, ner.correct, etc.

The downside is you'll need to redo your initial annotations. While you may be tempted to use your original annotations that used the default tokenizer, you'll likely run into a a problem as a small % of annotations are tokenized inconsistently (e.g., ner.manual used default tokenizer while your ner.correct uses your modified tokenizer). Alternatively, if you're good with python, you could try to identify which annotations the tokenizer would have a different behavior, relabel only those in ner.manual, then proceed.

The key message is keep 1 and only 1 tokenizer throughout your entire annotation/training process. This is echoed in the docs:

When using character-based highlighting, annotation may be slower and there’s no guarantee that the spans you annotate map to actual tokens later on. If your goal is to train a named entity recognizer, you should consider using the same tokenizer during annotation, to make sure that your data can be used.

Option 3: Pre-process/clean data and still use default tokenizer

The other option is to try to do "pre-process" the text, e.g., add in white space manually, to "trick" the spaCy default tokenizer to perform as you would want. For example, "unit 1,32 fake road" -> "unit 1, 32 fake road". I would caution against this as ideally you embed pre-process steps into spaCy's pipeline (e.g., through the tokenizer) so you're always processing raw data into your spaCy pipeline.

Let me know if this helps or if you have other questions!