I want to pre-define the spans to be annotated. I'm currently using a spans_manual view but it doesn't seem like you can highlight pre-defined spans and restrict the user from highlighting these. I'm ok with them manually still having to do highlights but I want to differentiate the spans we are interested in from the remaining text (such as giving it its own color).
Is there a way to do that. I tried something hack by adding TODO in front of the words I want to have a different color and then on prodigymount removing the TODO and setting the color and font-weight style properties. This works but causes huge issues when I accept and move on to the next annotation. In other words, the bolded/colored text carries over and is not removed from the DOM. I'm ok with this hacky approach but don't know how to properly remove the bolded/colored spans from the DOM when the view changes.
An alternative solution is just to find the code that replaces the annotation spans and make sure it gets rid of the ones that I've changed. Here is the change that I've made:
I am passing in
"Pt was admitted to the TODOID TODOservice" as the text and then doing this onprodigymount
var content = document.getElementsByClassName("prodigy-content")[0];
var spans = content.getElementsByTagName("span");
for (var i = 0; i < spans.length; i++) {
var span = spans[i];
if(span.textContent.startsWith("TODO")) {
span.textContent = span.textContent.replace("TODO", "");
span.style.color = "green";
span.setAttribute("marked", true);
}
}
Then when the accept or reject button is clicked I have tried to do the following:
var content = document.getElementsByClassName("prodigy-content")[0];
var spans = content.getElementsByTagName("span");
for (var i = 0; i < spans.length; i++) {
var span = spans[i];
if(span.hasAttribute("marked")) {
span.removeAttribute("marked");
span.textContent = "ISENT" + span.textContent;
span.style.removeProperty("color");
}
}
Yet this is not working properly. I also tried span.removeElement() but that didn't work either.
You can pre-fill examples by adding spans to your example in a custom recipe and you can customise the label color either in the config section of your custom recipe or in the prodigy.jsonl file.
This Prodigy short gives an example of how that works, both for images and (at the end) spans.
Thanks for the reply! Unfortunately, our use case is different. We aren't interested in assigning certain colors to labels, but rather, just indicating that we want users to label specific spans of text. Our task is basically entity classification where we define the entities.
So if the input is
Pt complains of headache and has history of migraines.
We want to tell our annotators to only label headache and history of migraines (predefined spans of text that we are labeling).
We want to visualize headache and history of migraines to stand out from the other text to indicate what they should be labeling. Is this possible -- I am OK with my hacky solution to prepend an indicator to the front of each word in a highlighted span but it seems it causes a lot of issues when the next example is loaded.
Hi @griff4692,
sorry for the late reply. Maybe the following approach solves your problem:
You can add a style-key to the tokens to insert custom css for tokens. I used the following recipe:
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
import spacy
nlp = spacy.blank("en")
def insert_style_key(example):
tokens = []
for token in example["tokens"]:
t = dict(token)
if any([t["id"] in range(s['token_start'], s['token_end']+1) for s in example["spans"]]):
t["style"] = {"color": "#ab00ff"}
tokens.append(t)
example["spans"] = [] # in case you don't want to have the spans to be pre-highlighted
example["tokens"] = tokens
return example
@prodigy.recipe("data-recipe",
dataset=("Dataset to save answers to", "positional", None, str),
path=("Path to jsonl", "positional", None, str))
def data_review_recipe(dataset, path):
def get_stream(stream):
for s in stream:
yield insert_style_key(s)
stream = JSONL(path)
stream = add_tokens(nlp, stream)
return {
"view_id": "ner_manual",
"dataset": dataset,
"stream": get_stream(stream),
"config": {
"labels": ["DAY"], # the labels for the manual NER interface
}
}
My data looked like this, having pre-defined spans and text.
{"text": "TLDR: Worst experience ever, never ordering from here again.", "spans": []}
{"text": "My order arrived last Tuesday.", "spans": [{ "start": 22, "end": 29, "label": "DAY" }]}
Using this recipe, this is how the second text, having a pre-defined span, looks in prodigy:
Thanks for this - I will try this out and get back to you but looks promising!! Is there a way to avoid defining a label on the spans? It looks like each span must be provided a "label" column but I don't want to bias the annotators - just want the text to appear in a different color.
For the pre-defined spans you do not necessarily need a label column to make the recipe work. For example, the recipe I posted does not require a label attribute.
Another idea could be to delete the spans from the input after the tokens have been colored. If you leave the spans in the input, your annotators would have to deselect the pre-defined span and select it again. In my recipe above, I overwrite the pre-defined spans with an empty list such that the tokens inside the spans are colored but not already marked as spans inside of prodigy.
Thanks - This looks really nice and almost fully works for me. I'm getting an error about an invalid span
File "cython_src/prodigy/components/preprocess.pyx", line 246, in prodigy.components.preprocess.sync_spans_to_tokens
ValueError: Mismatched tokenization. Can't resolve span to token index 1335. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.
{'start': 1320, 'end': 1335, 'token_start': 271}
Any advice on how to debug this? I am writing to the .jsonl with spans that match my pre-selected boundaries but I don't spacy tokenize until recipe.py.
This error usually occurs when the boundaries of your tokens and spans mismatch, e.g. when the end of the span points to a char inside a token which is not the end of the token itself. You can change it by changing end for example.
If that doesn't solve the problem, could you share the line that throws the error?