BIO tagging of character data

I'd like to confirm that prodigy can tag character data and export in BIO format.

i.e. given the input "ZnT-SP" it would produce the BIO tags Bzone Izone Btemperature O Bsetpoint Isetpoint"

I see there's a demo of "Named Entities (character-based)" which seems to do what I want but it's not 100% clear since it's in Chinese (I think). Plus I don't see a way of seeing the BIO result using the demo?

And is it possible for the annotated to add tags? Or do they have to be created in advance? For example in the case above if "Zone" wasn't a predefined tag can it be added then applied?

And if I wanted to integrate this into an Active Learning pipeline - is it possible via an API to pre-select the most likely tags, and sort / filter the set of available tags into some sort of order based on importance (i.e. based on the context put the more likely tags at the front of the list)

1 Like

Here's a character-based example with English text: https://prodi.gy/docs/named-entity-recognition#highlight-chars

The annotations you can export include the start and end character offset of the span, as well as the start and end token index the span refers to. You can also convert character offsets to BILUO/IOB tags programmatically – see here for an example.

However, since those tags always refer to tokens, you'd have to decide on what the tokens are in your data – for instance, if you want "ZnT" in "ZnT-SP" to be an entity span, your tokenization needs produce a separate token for it, otherwise you won't be able to create a tag for that token and your model wouldn't be able to learn from it.

Prodigy expects you to provide the label scheme upfront and it's typically not something you'd want the annotator to decide at runtime. The presence and absence of a label can make a big difference for the model and if the labels change during annotation, this can easily lead to very inconsistent data.

You'll also need to know the label scheme if you want to take advantage of a model in the loop, because the model should be initialized with all available labels that it's going to predict.

Do you mean the labels displayed at the top of the annotation card? In theory, that's possible – you could just look at the top X possible anlyses for the given text, take all entity labels of the predicted spans and then override the "labels" via the "config" setting of each individual annotation task that gets sent out.

I'm not sure this will really help with efficiency, though, since it means that the order of labels can change with every example and the annotator has to reorient themselves constantly. To me it seems more useful to just have the model pre-highlight the most confident predictions in the text, e.g. like the ner.correct recipe does it. Changing the order of labels is something I can see working better for text classification where you have labels for the whole text and could pre-select the most confident predictions and move the more uncertain labels further up.

Streams that queue up data for annotation are just Python generators under the hood that yield dictionaries – so you can implement any custom logic for selecting the examples, pre-highlighting entities using your model, sorting examples, skipping texts etc. Here's an example for active learning with a custom model.

Thanks Ines

I'm still working through different approaches. I'm not quite following what you mean by "if you want "ZnT" in "ZnT-SP" to be an entity span, your tokenization needs produce a separate token for it, otherwise you won't be able to create a tag for that token".

I'm using the --highlight-chars option with ner_manual as I need to treat the input as a sequence of characters. It seems to work in the sense that I get back spans which identify the start and end characters - i.e.

"spans":[{"start":13,"end":16,"label":"AHU"},{"start":17,"end":18,"label":"ID"}]

I've not been able to get --patterns to work with --highlight-chars though. i.e. in the case above I want to match the sequence 'A','H','U' to the label AHU. I tried a few options but none work - i.e.

{"label": "AHU", "pattern": "AHU"}
{"label": "AHU", "pattern": [{"lower": "a"},{"lower": "h"},{"lower": "u"}]}

Do I need to create character tokens for this to work?

Also I have a large number of tags, I need to figure out how to manage this but the immediate issue is Prodigy isn't actually showing the text to annotate once the number of tags exceeds some value so I cannot use it at all. Is there any way of customising the annotation screen - I think I need the ability to apply some filters?

Ah, sorry if I phrased this in a confusing way! Since you mentioned that your goal is to create BIO-formatted annotations, I'm assuming your goal is to train a model later on, right? So at the end of the day, you need some kind of process that runs during annotation and at runtime, takes the raw text and splits it into tokens that can be labelled with I/O/B. That could be word pieces, characters (although that's maybe less efficient) or any other segments.

You can do character-based highlighting during annotation and highlight "A" in "AB-CD" – but if you tokenizer splits that string into ["AB", "-", "CD"], your model won't be able to learn anything useful and it won't be able to predict B-SOME_LABEL for "A".

Ah, that's kind of an edge case at the moment that doesn't work for characters! But you could implement your own pre-highlighting, which should be straightforward in your case since you're matching single characters and can use simple regular expressions. Under the hood, pattern matches are just "spans" that are added to the incoming examples – so for each sequence, you can add a span with "start", "end" and "label", and it will be pre-highlighted in the UI.

How many labels do you have? And do you have many labels where the sequence of characters maps to the label? Like "AHU"AHU? It seems like that could just be one single label that you apply to all of these spans?

As a rule of thumb, if you have so many labels that it becomes a pain to manage them, it's typically a sign that you can simplify and revise your label scheme to make the problem easier to train and eaiser to annotate. But it really depends on the specifics of waht you want to train and how you're postprocessing the annotations for training. (I'm assuming the goal isn't to actually train, say, an NER model with all the labels and have it predict "AHU" → AHU, which would likely be very inefficient.)

Hi Ines,

I have several hundred labels (tags as they're known in application field). I'm trying to convert sensor names where the name has been assigned by a human with limited characters to a set of tags from a standard ontology. It's bascically a parser - the complexity is I have to extract named entities out of the text as well as generate a set of tags.

So I might have "NAE45-1/FC-1.AHU 1.SAF-C" Which translated to AHU ID:1 SUPPLY AIR FLOW COMMAND. Regex can help but the abbreviations aren't consistent and there's thousands to test.

The way this has been approached in the past is to use a CRF decoder with BIO tagging then a second MLP to map to the final representation, so I'm hoping to replciate that. It's been shown to be much more accurate than a hand crafted parser. The main reason I think that the BIO tagging is used is to extract the named entities. I'm considering an alternative approach where I use seq2seq to map the input chars to "AHU SUPPLY AIR FLOW COMMAND" and a seperate NER model to extract the entitity name("1") which could be easier to annotate. Plus I've found a couple of seq2seq papers which use a "copy" to extract literal text from the input sequence if required so that might help.

I guess you could categorise the problem as machine translation, from one language using character tokens to another using work tokens, where the annotator must supply the translations?