Yay My first idea would be to try using the "choice"
interface to select which of the tags applies. See here for a demo. The task could display the highlighted token (without a label), and two (or more) options of the tags. You could then click through and select the correct label. The interface also supports keyboard shortcuts, so once you’re in a good flow, annotation should be super fast.
You could pre-process your data and only create an annotation task for the ambiguous examples. A task could look like this:
{
"text": "Dies ist ein Testsatz",
"spans": [{"start": 0, "end": 3}],
"options": [
{"id": "ART", "text": "ART"},
{"id": "PDS", "text": "PDS"}
]
}
In this example, the "spans"
property is mostly used to highlight the token in question. The options will be displayed underneath the text. Prodigy also supports passing in arbitrary metadata, which is preserved with the task – so you could add any other custom properties like references to your corpus or dataset, which will help you relate the annotations back to the original data later on.
Here’s an example of a simple data conversion script. To highlight a span in your text (e.g. the current token), Prodigy expects the character offsets into the text. So if you don’t have this in your original corpus, you’d have to write a little function that does this.
examples = [] # export this later
for tags, tokens in YOUR_TAGGED_CORPUS:
text = ' '.join(tokens) # or maybe you do have the original text?
for i, tag in enumerate(tags):
if '/' not in tag:
continue # exit early if tag not ambiguous
tag_options = tag.split('/')
options = [{'id': t, 'text': t} for t in tag_options]
# calculate the character offsets of the current token
start, end = CALCULATE_OFFSETS(tokens, i)
spans = [{'start': start, 'end': end}]
task = {'text': text, 'spans': spans, 'options': options}
examples.append(task)
Here’s an example of using spaCy to calculate the character offsets if you don’t have them in your data. You might also make that function return the doc.text
, to make sure it matches the offsets. If you data includes information on whether the token is followed by whitespace, you can include that using the spaces
keyword argument – for example, spaces=[True, True, True, False]
.
from spacy.tokens import Doc
from spacy.vocab import Vocab
def calculate_offsets(tokens, i):
doc = Doc(Vocab(), words=tokens)
token = doc[i]
return token.idx, token.idx + len(token)
Once you’ve converted the data, you can then export the examples to a JSON or JSONL file and load that into Prodigy. I’d recommend using the mark
recipe, which will simply show you whatever you load in, in order:
prodigy mark pos_dataset your_exported_data.jsonl --view-id choice
If you set "choice_auto_accept": true
in your prodigy.json
, the choice answer will automatically be “locked in” when you select it, and you won’t have to click the accept button explicitly. You can still always go back and click reject or ignore – for example, if it turns out that all tags are incorrect or the example is bad etc.
When you’re done annotating, you can export and convert the annotations to whichever format you need. Here’s how to export a dataset:
prodigy db-out pos_dataset > pos_annotations.jsonl
The annotations will include an "accept"
property, which is a list of all selected option IDs. For example, "accept": ["NN"]
. So you could loop over the data, match it up with your original corpus and overwrite the ambiguous tags with the correct ones.