BIO tagging of character data

I'd like to confirm that prodigy can tag character data and export in BIO format.

i.e. given the input "ZnT-SP" it would produce the BIO tags Bzone Izone Btemperature O Bsetpoint Isetpoint"

I see there's a demo of "Named Entities (character-based)" which seems to do what I want but it's not 100% clear since it's in Chinese (I think). Plus I don't see a way of seeing the BIO result using the demo?

And is it possible for the annotated to add tags? Or do they have to be created in advance? For example in the case above if "Zone" wasn't a predefined tag can it be added then applied?

And if I wanted to integrate this into an Active Learning pipeline - is it possible via an API to pre-select the most likely tags, and sort / filter the set of available tags into some sort of order based on importance (i.e. based on the context put the more likely tags at the front of the list)

Here's a character-based example with English text: https://prodi.gy/docs/named-entity-recognition#highlight-chars

The annotations you can export include the start and end character offset of the span, as well as the start and end token index the span refers to. You can also convert character offsets to BILUO/IOB tags programmatically – see here for an example.

However, since those tags always refer to tokens, you'd have to decide on what the tokens are in your data – for instance, if you want "ZnT" in "ZnT-SP" to be an entity span, your tokenization needs produce a separate token for it, otherwise you won't be able to create a tag for that token and your model wouldn't be able to learn from it.

Prodigy expects you to provide the label scheme upfront and it's typically not something you'd want the annotator to decide at runtime. The presence and absence of a label can make a big difference for the model and if the labels change during annotation, this can easily lead to very inconsistent data.

You'll also need to know the label scheme if you want to take advantage of a model in the loop, because the model should be initialized with all available labels that it's going to predict.

Do you mean the labels displayed at the top of the annotation card? In theory, that's possible – you could just look at the top X possible anlyses for the given text, take all entity labels of the predicted spans and then override the "labels" via the "config" setting of each individual annotation task that gets sent out.

I'm not sure this will really help with efficiency, though, since it means that the order of labels can change with every example and the annotator has to reorient themselves constantly. To me it seems more useful to just have the model pre-highlight the most confident predictions in the text, e.g. like the ner.correct recipe does it. Changing the order of labels is something I can see working better for text classification where you have labels for the whole text and could pre-select the most confident predictions and move the more uncertain labels further up.

Streams that queue up data for annotation are just Python generators under the hood that yield dictionaries – so you can implement any custom logic for selecting the examples, pre-highlighting entities using your model, sorting examples, skipping texts etc. Here's an example for active learning with a custom model.