UI for correcting tokens (to create data for training a tokenizer)

Hi, I'm looking for a way to to correct a piece of tokenized text to retrain a tokenizer. The UI would need a way to both split and merge adjacent tokens.

Is there a way to customize Prodigy to do this?

Hi! One option to implement something like this could be via the --highlight-chars setting, which lets you highlight individual characters: https://prodi.gy/docs/named-entity-recognition#highlight-chars

Under the hood, this works by setting use_chars=True on the add_tokens preprocessor, which in turn will add one entry in the "tokens" for each character. The "spans" you annotate would then reflect the actual tokens, and you could pre-populate them using your tokenization.

However, annotating every single character and token can easily become pretty tedious and messy, so this might be a good use case for some automation and pre-selection so you can primarily focus on the tokens that matter the most and that your tokenizer is more likely to get wrong. There are likely a lot of tokens that your tokenizer will always get right, no matter what, and those are likely also going to be the most common words (e.g. "the"), which will be extremely common in your data (see Zipf's law). So you could set the "tokens" characters of these words to "disabled": true in your data to make them unselectable and only keep the more complex tokens for annotation. This will be much faster to annotate, and the data you get from it will be just as accurate. If the tokenizer you're training gives you confidence scores, that's also something you could exploit: start by focusing on examples with lower-confidence tokens, which will help you select data that your model can learn the most from.

Hi Ines, thanks so much for your prompt and elaborate answer!

As you describe this can become very complex, something which I'd like to prevent. Instead of using a workaround by using existing features of Prodigy, I'm thinking of making something new. Would it be possible to create my own UI component (Javascript) and corresponding Python class by extending Prodigy, to simplify the process? I can't really find any examples where Prodigy is extended in both frontend and backend.

I was thinking of adding a "breaker" and a "merger" annotation respectively between characters and tokens in the UI. These annotations are then used to merge and create new tokens with the correct start and end positions.

What do you think?

In theory, yes – you could have a custom recipe using the html interface and a html_template and then provide custom "javascript" that lets you interact with the HTML elements. For instance, you could update properties in your JSON task when you click on a certain element.

How easy/complex this will be really depends on what you're trying to do and how you want the interactions to work in the UI. I'm actually pretty curious how you envision the "breaker" and "merger" annotations to work and what the interaction pattern could look like in the UI – also because I'm always interested in ideas for new potential interfaces. I guess one option could be to visualise all "breaks" (i.e. token boundaries) in a subtle way and then allow adding/removing them on click. In the underlying JSON, those breaks could then be represented as a list of character offsets, indicating where to split.

Implementing this via custom HTML and JS would definitely be a bit more involved, though, because you'd have to re-implement the whole rendering of the text with the inserted token breask, and find a way to determine the character offset of the characters you're clicking on in the UI :thinking: I can't immediately think of a straightforward way to make this work efficiently – but I'll definitely keep thinking because this is potentally very interesting!

This is my idea of the interface:

  1. By default you see a green cursor when hovering over the text. This is the token breaker/splitter.
  2. You can join 2 tokens by clicking on a token and the arc (merger/joiner) will appear to the nearest adjacent token on the left or the right, which ever is closest to your cursor.
  3. By dragging characters, you can select them to for deletion. The reason I added this option is because of hyphen characters which can often appear when words are broken at the and of sentences.

I have included a rough sketch of what it should do.

Thanks so much for the detailed and thoughtful suggestion :blush::pray:

One additional idea I just had that could potentially simplify the interaction pattern and make annnotation more efficient: ultimately, the "merger" action is just the absence of a splitter, right? So you might be able to achieve the same result by just adding/removing splitters, which could be indicated in the UI using a subtle symbol (represented by | here):

This | is | a | n | examplesentence

Removing one splitter and adding one (e.g. by clicking at the respective character position) will give you the desired tokenization:

This | is | an | example | sentence

I think it's important to do this on the raw texts with character offsets only, and ensure to preserve trailing whitespace so you'll always be able to reconstruct the original text from the tokens (for instance, spaCy preserves this in the Token.ws and Token.text_with_ws_ attributes and it's also something Prodigy records as a token's "ws" attribute).

As a general feature of the same interface, I think this could potentially be problematic: we always want to make sure to promote non-destructive tokenization, because otherwise, you end up with data that doesn't let you reconstruct the original text. So if we had a tokenization UI like this, I'd be reluctant to make this an official feature because for most use cases, this would be counter-productive and lead to problems down the line.

That said, if this is something you want to do as part of your pre-processing logic, you could assign a specific label to the tokens and then remove them later in a separate step.