Disambiguate POS Tags

Hi :slight_smile:

thanks for activating my research licence - I’m soo exited!

I want to use prodigy for disambiguating part of speech tags (for a low-resource European language). That means I already have a nice pos tagged dataset, but some tokens are not disambiguated. Here’s one example:

tokens = ["Das", "ist", "ein", "Testsatz", "."]

would lead for example to the following tags:

["ART/PDS", "VAFIN", "ART/PDS", "NN", "."]

How would the training data for that kind of problem look like?

Thanks in advance + regards,

Stefan

Yay :tada: My first idea would be to try using the "choice" interface to select which of the tags applies. See here for a demo. The task could display the highlighted token (without a label), and two (or more) options of the tags. You could then click through and select the correct label. The interface also supports keyboard shortcuts, so once you’re in a good flow, annotation should be super fast.

You could pre-process your data and only create an annotation task for the ambiguous examples. A task could look like this:

{
   "text": "Dies ist ein Testsatz",
    "spans": [{"start": 0, "end": 3}],
    "options": [
        {"id": "ART", "text": "ART"},
        {"id": "PDS", "text": "PDS"}
    ]
}

In this example, the "spans" property is mostly used to highlight the token in question. The options will be displayed underneath the text. Prodigy also supports passing in arbitrary metadata, which is preserved with the task – so you could add any other custom properties like references to your corpus or dataset, which will help you relate the annotations back to the original data later on.

Here’s an example of a simple data conversion script. To highlight a span in your text (e.g. the current token), Prodigy expects the character offsets into the text. So if you don’t have this in your original corpus, you’d have to write a little function that does this.

examples = []  # export this later

for tags, tokens in YOUR_TAGGED_CORPUS:
    text = ' '.join(tokens)  # or maybe you do have the original text?
    for i, tag in enumerate(tags):
        if '/' not in tag:
            continue  # exit early if tag not ambiguous
        tag_options = tag.split('/')
        options = [{'id': t, 'text': t} for t in tag_options]
        # calculate the character offsets of the current token 
        start, end = CALCULATE_OFFSETS(tokens, i)
        spans = [{'start': start, 'end': end}]
        task = {'text': text, 'spans': spans, 'options': options}
        examples.append(task)

Here’s an example of using spaCy to calculate the character offsets if you don’t have them in your data. You might also make that function return the doc.text, to make sure it matches the offsets. If you data includes information on whether the token is followed by whitespace, you can include that using the spaces keyword argument – for example, spaces=[True, True, True, False].

from spacy.tokens import Doc
from spacy.vocab import Vocab

def calculate_offsets(tokens, i):
    doc = Doc(Vocab(), words=tokens)
    token = doc[i]
    return token.idx, token.idx + len(token)

Once you’ve converted the data, you can then export the examples to a JSON or JSONL file and load that into Prodigy. I’d recommend using the mark recipe, which will simply show you whatever you load in, in order:

prodigy mark pos_dataset your_exported_data.jsonl --view-id choice

If you set "choice_auto_accept": true in your prodigy.json, the choice answer will automatically be “locked in” when you select it, and you won’t have to click the accept button explicitly. You can still always go back and click reject or ignore – for example, if it turns out that all tags are incorrect or the example is bad etc.

When you’re done annotating, you can export and convert the annotations to whichever format you need. Here’s how to export a dataset:

prodigy db-out pos_dataset > pos_annotations.jsonl

The annotations will include an "accept" property, which is a list of all selected option IDs. For example, "accept": ["NN"]. So you could loop over the data, match it up with your original corpus and overwrite the ambiguous tags with the correct ones.