Assign one token with two labels

sigitpurnomo · April 20, 2021, 4:25am

Hello Ines and Matthew,

I have using prodigy to do a NER annotation in the Indonesian languge. I am facing a problem when I should assign one token with two labels. For example, in English, the format of a quotation is like this: ["direct-quote," said him], so I can easily label "said" as REPORTING-VERB and "him" as PERSON-COREF. But, in the Indonesian language, the reporting verb and the person coreference are written in one word or token, for example, "ujarnya", so I cannot assign the label REPORTING-VERB for "ujar" and PERSON-COREF for "nya".

Do you have any suggestion for this problem,

Thank you
Sigit

ines · April 21, 2021, 1:01am

Hi! So if I understand the question correctly, the goal here isn't necessarily that one token should have two labels, but that one whitespace-delimited chunk that's split into a single token ("ujarnya") should actually be two tokens, right?

One way to solve this would be to introduce additional tokenizer exceptions for splitting words like this. Of course, this works best if there's a more or less finite list of these types of expressions, or some other pattern that you can use. If it's not possible to express the logic with tokenizer exceptions and/or you need more annotations to make the decision (e.g. POS tags or dependency labels), you could use Matcher rules and Doc.retokenize to split the tokens further. This means you'll be able to train a model to predict the correct token-based tags for these expressions, and assign the labels more easily.

sigitpurnomo · April 21, 2021, 2:48am

Hi Ines, thank you for your response.

So, for now, in Prodigy still not possible to block the word "ujarnya" into two separated labels, "ujar" for REPORT-VERB and "nya" for PERSON-COREF? When I try to annotate/label the word "ujarnya", the UI always blocks the "ujarnya" although I just block the word "ujar". It's also the same when I try just to block the "nya". The UI will automatically block the "ujarnya" and assigned the label that I have already choose.

Update Note:
Ines, I have found the solution for this problem by adding the --highlight-chars flag

Thanks

ines · April 21, 2021, 5:17am

This isn't really related to Prodigy – under the hood, it just uses the tokenization provided by whichever model you use. So ideally, you want to be changing the tokenization or use the tokenization you want to apply during training. Otherwise (e.g. if you're just highlighting characters), you'll be creating annotations for tokens that your model never produces, so you won't be able to learn from them.

sigitpurnomo · April 21, 2021, 6:05am

Hi Ines.

Ok, thank you for your information and suggestions. I really appreciated it.

Topic		Replies	Views
Annotating strings without correct separation ner , best-practices	8	194	November 21, 2024
revising annotation by prodigy--here only one label (DATE) usage , ner , solved	16	1931	May 20, 2019
Annotating text with multiple labels simultaneously usage , ner , solved	1	426	April 20, 2020
Mismatching spans usage , ner , solved	3	336	July 15, 2021
Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy ner , spacy , custom , training	1	418	July 7, 2023

Assign one token with two labels

Related topics