Hi! I would like to extract entities from a text and change entities boundaries manually when visualizing them in Prodigy. As I don’t need a model for the moment, I am using the mark recipe but, in order to change them manually, I would need to use the ner_manual interface and, as far as I understood, I would need to have the following format for my entities:
But I have only the spans and not the tokens so I can only use the ner interface which doesn’t allow me to change the boundaries for the entities manually. This is the format that I have:
{
"text": "Apple updates its analytics service with new metrics",
"spans": [
{"start": 0, "end": 5, "label": "ORG"}
]
}
I would like to know how can I get the tokens for the entities and then been able to use the ner_manual interface. Or maybe there is another option for changing manually the boundaries of the entities that I could consider.
Hi! What happens if you just load in your data the way it is? If no "tokens" are present, Prodigy will use the model’s tokenizer to generate them automatically for you
The only thing that’s important is that your entity spans align with the token boundaries – for instance, “Apple” is fine, because that’ll be one token. But the span “Apple up” wouldn’t because “up” isn’t a standalone token. If the pre-annotated spans don’t align, Prodigy will raise an error and show you the mismatch, so you can fix it. It usually shouldn’t happen, though – unless your data is very noisy.
Hi! In fact that is the problem: my entity spans don’t align with the token boundaries (sorry that I didn’t mention it before) and this is the error that I am having if I use ner_manual
ERROR: Invalid task format for view ID 'ner_manual'
'tokens' is a required property
'token_start' is a required property [spans -> 0]
'token_end' is a required property [spans -> 0]
'token_start' is a required property [spans -> 1]
'token_end' is a required property [spans -> 1]
Ah, okay, that makes sense then. Are you just using mark? I think I misread your initial question and thought you were using the built-in ner.manual recipe, which does take care of the tokenization automatically.
If you need your own custom tokens that align with your entity spans, then you also need to provide them. It might be worth writing a little script to check how many of the spans do not align – maybe it’s just one or two that you can easily correct manually (or exclude from your data).
An easy way to do this is to use spaCy’s Doc.char_span method, which creates a token span from character offsets. If the character offsets don’t align to the tokens, it returns None. So you can do something like this:
nlp = spacy.load("en_core_web_sm") # or other model
for example in examples: # your existing examples
doc = nlp(example["text"])
for span in example["spans"]:
char_span = doc.char_span(span["start"], span["end"])
if char_span is None: # start and end don't map to tokens
print("Misaligned tokens", example["text"], span)
Sorry that this a little confusing. The --label setting in the mark recipe really only adds a top-level "label" key to the tasks. For manual annotation, you want to provide a "label set" for the entire session. This is usually done by adding a list of "labels" to the "config" returned by the recipe.
The mark recipe is pretty agnostic to what you're annotating. It just streams whatever comes in and renders it. So it currently doesn't have any special case rules that add the labels to the config if it's a manual recipe – that's really what the more task-specific recipes like ner.manual do.
(You can also take a look at the source of the recipe files included in Prodigy – or check out the slightly simplified versions with comments in our prodigy-recipes repo. There you can see what the different recipes do under the hood, and how they use the arguments you pass in on the command line.)
I will have a look at it. I am starting with Prodigy and I am still at the beginning of the exploration of the different possibilities the tool can offer.
What is sure is that you are helping me a lot! It is much appreciated!