Assign one token with two labels

Hello Ines and Matthew,

I have using prodigy to do a NER annotation in the Indonesian languge. I am facing a problem when I should assign one token with two labels. For example, in English, the format of a quotation is like this: ["direct-quote," said him], so I can easily label "said" as REPORTING-VERB and "him" as PERSON-COREF. But, in the Indonesian language, the reporting verb and the person coreference are written in one word or token, for example, "ujarnya", so I cannot assign the label REPORTING-VERB for "ujar" and PERSON-COREF for "nya".

Do you have any suggestion for this problem,

Thank you
Sigit

Hi! So if I understand the question correctly, the goal here isn't necessarily that one token should have two labels, but that one whitespace-delimited chunk that's split into a single token ("ujarnya") should actually be two tokens, right?

One way to solve this would be to introduce additional tokenizer exceptions for splitting words like this. Of course, this works best if there's a more or less finite list of these types of expressions, or some other pattern that you can use. If it's not possible to express the logic with tokenizer exceptions and/or you need more annotations to make the decision (e.g. POS tags or dependency labels), you could use Matcher rules and Doc.retokenize to split the tokens further. This means you'll be able to train a model to predict the correct token-based tags for these expressions, and assign the labels more easily.

Hi Ines, thank you for your response.

So, for now, in Prodigy still not possible to block the word "ujarnya" into two separated labels, "ujar" for REPORT-VERB and "nya" for PERSON-COREF? When I try to annotate/label the word "ujarnya", the UI always blocks the "ujarnya" although I just block the word "ujar". It's also the same when I try just to block the "nya". The UI will automatically block the "ujarnya" and assigned the label that I have already choose.

Update Note:
Ines, I have found the solution for this problem by adding the --highlight-chars flag :pray:

Thanks

This isn't really related to Prodigy – under the hood, it just uses the tokenization provided by whichever model you use. So ideally, you want to be changing the tokenization or use the tokenization you want to apply during training. Otherwise (e.g. if you're just highlighting characters), you'll be creating annotations for tokens that your model never produces, so you won't be able to learn from them.

Hi Ines.

Ok, thank you for your information and suggestions. I really appreciated it.