revising annotation by prodigy--here only one label (DATE)

solved
usage
ner
(Ya) #1

Dear all,

I annotated the data by regex and made that in the desired format for prodigy as you see

Blockquote

{"text":"As shown in Figure 2B, the Sun is assumed to be at the center of the planetary system.","spans":   [{"start":11,"end":20,"label":"DATE"}]}

Blockquote

as I got in order to manually revise my annotation I need to

python -m prodigy dataset ner_01
python -m prodigy ner.manual ner_01 en_core_web_sm NER_01.jsonl --label “DATE”

but every time I tried this after a few sentence it failed with this error
"
Mismatched tokenization. Can’t resolve span to token index … This can happen if your data contains pre-set spans. Make sure that the spans match spaCy’s tokenization or add a ‘tokens’ property to your task.
"
my data of course has pre-set span …I thought becuase It has already label I do not need to call label"s name I tried this

python -m prodigy ner.manual ner_01 en_core_web_sm NER_01.jsonl

then I had all labels like ORG. and which is not my case…could you let me know what is wrong

bests

(Ines Montani) #2

The “mismatched tokenization” error means that one or more "spans" you’ve added to your data do not map to the tokens assigned by the model’s tokenizer. For example, let’s say you have a sentence like "Next Monday~" and you’ve labelled the span for Monday – but the tokenizer only splits the text into two tokens: ["Next", "Monday~"]. Then your annotations will never be “true”, because there will never be a token "Monday" in that example. If you’re training a model with those examples, that would be pretty bad, because it wouldn’t learn anything.

TL;DR: All "spans" should map to token boundaries. If you know what the tokenization should be (e.g. you already have the pre-tokenized text), you can also provide a "tokens" key in the data that tells Proidgy how the text should be split.

You can also use the method described here to find the examples with mismatched tokenization – maybe it turns out it’s only one or two texts with very specific punctuation or extra whitespace.

That’s expected. If you do not specify a --label on the command line, Prodigy will use the labels that are present in the model. If you’re running ner.manual, you usually always want to specify one or more label options via the --label argument. For example, you could have your examples pre-labelled with DATE, and then manually add another label PERSON. In that case, you could write --label DATE,PERSON, and you’d be able to select those labels.

(Ya) #3

Dear ines,

It means that

  python -m prodigy dataset ner_01
  python -m prodigy ner.manual ner_01 en_core_web_sm NER_01.jsonl --label “DATE”

is correct for that task?

another example of my data is so

Therefore, the time of opposition is 17 hours 20 minutes before March 29, at 21:43, the time when the observation was made.",“spans”:[{“start”:63,“end”:72,“label”:“DATE”}]}

It means that prodigy consider automatically for example “March 29,” not “March 29” as a token?
is there any way to make this off ?

regarding the providing token key can you elaborate more?

I prefer rather to import this version and annotate, is there any other way like removing all signs like “,” other extra sign?

best

(Ines Montani) #4

Yes, the commands you ran are correct :+1:

Prodigy itself doesn’t have an opinion on how things should be tokenized – but if your data doesn’t include any "tokens", it will use spaCy to add them. It’ll run the model over the text, get the individual tokens and then try to match up your span offsets with the existing tokens. (You can find an example of the "tokens" format in the “Annotation task formats” section in your PRODIGY_README.html btw.)

Normally, this is not a problem, because what’s considered a “word” is often quite consistent. But there are always edge cases, especially if you’re working with numbers and punctuation, or if your data isn’t annotated cleanly. Here’s how spaCy’s English model tokenizes that text:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Therefore, the time of opposition is 17 hours 20 minutes before March 29, at 21:43, the time when the observation was made.")
print([token.text for token in doc])
['Therefore', ',', 'the', 'time', 'of', 'opposition', 'is', '17', 
'hours', '20', 'minutes', 'before', 'March', '29', ',', 'at', 
'21:43', ',', 'the', 'time', 'when', 'the', 'observation', 
'was', 'made', '.']

This looks pretty reasonable and correct. However, if you check out the character span 63:72, you’ll get the following:

' March 29'

As you can see, it includes a ' ' at the beginning, which is probably not intended, right? So spaCy will be unable to create an entity span at character 63, because that’s not a valid token. However, if you change the span to 64:72, it should work as expected, because those characters map to a valid sequence of tokens:

'March', '29'

If your other examples look like that, too, and you don’t have too many misaligned tokens, it probably makes sense to fix the character offsets by hand and clean up your annotations that way. Having trailing or leading whitespace in your annotations can easily lead to other problems later on (even if you’re not using spaCy or Prodigy or are planning to train a model).

(Ya) #5

Thank you for your response, I got a bit confused, I guess here the problem is , after March we have “,”

March 29,

am I right?

I see two kind of solutions, it would be nice if you help go further in each one or help me to decide which one is better:

I am arriving to this kind conclusion that I should either:

-Add tokens somehow with regex (It seems a bit hard for me ) although I have kind of code that can give me my token

or

-forget about Regex and try to annotate raw data by spacy , If I do and annotate data by annotator and your framework, is there a way to improve annotation by spacy (prodigy) after that. or I will have the same problem for tokens. (we have 6000 sens)

which one do you recommend?

(Ines Montani) #6

Sorry, I just realised that my example above wasn’t directly referencing the example you posted. So I just updated it. And no, March 29, is correctly split into three tokens: 'March', '29', ','. So a span describing “March 29,” will also map to valid tokens.

From what you’ve posted, I think the problem is the span {"start": 63, "end": 72, "label": "DATE"}, which describes the characters ' March 29'.

text = "Therefore, the time of opposition is 17 hours 20 minutes before March 29, at 21:43, the time when the observation was made."
print(text[63:72])
# ' March 29'

The character at 63 doesn’t map to a token, because it’s whitespace. So the correct span here would be 64:72. Even outside of the context of spaCy or Prodigy, you’d probably want your data to annotate 'March 29', not ' March 29'.

I would suggest that you try my code here to find the spans that weren’t cleanly annotated and fix them by hand (e.g. by adjusting the character offsets). Hopefully, there aren’t too many of them and it’s not too much work.

(Ya) #7

I just noticed that actually the problem seems different :as far as I got to this sentence

I, as you see in Ch.

{'text': 'I, as you see in Ch.', 'spans': []},

I faced with error

 ValueError: Mismatched tokenization. Can't resolve span to token index 20. This can happen if your data      contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to   your task.

{'start': 3, 'end': 20, 'label': 'DATE', 'token_start': 1}

but there is no span in this sesntences! could you let me know what is the reason of this error

Best

(Ines Montani) #8

How do you know it’s that example? Because the one pointed out in the error message is this span:

Can you find this somewhere in your data?

(Ya) #9

you are right, It was not the case, I tried a lot of other example with different pattern, normally after a while it fails:
i used this pattern

re.compile(r'\d{4} [A-Z][a-z.]+ \d{2}') 

like here, it fails

 On 1595 October 30 at 8h 20m, they found Mars at 17° 48’ Taurus, with a diurnal motion of 22’ 54” ^15.',

‘spans’: [{‘start’: 3, ‘end’: 18, ‘label’: ‘DATE’}]},

 {'start': 3, 'end': 18, 'label': 'DATE', 'token_start': 1}

I faced with error. sometimes it works sometimes not
…for instance here it is working

       {'text': 'On 1580 November 12 at 10h 50m,1 they set Mars down at 8° 36’ 50” Gemini2 without mentioning the horizontal variations,  by which term I wish the diurnal parallaxes and the refractions to be understood in what follows.',

‘spans’: [{‘start’: 3, ‘end’: 19, ‘label’: ‘DATE’}]}
I got a bit confused, but it should be way

(Ya) #10

or here, again it is problem

text': 'But on January 24/February 3 at the same time it was at 6° 18’ Leo.',

‘spans’: [{‘start’: 7, ‘end’: 14, ‘label’: ‘DATE’}

with simple syntax

 re.compile(r'January|February|March|April|May|June|July|August|September|October|November|December') 

ValueError: Mismatched tokenization. Can’t resolve span to token index 18. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy’s tokenization or add a ‘tokens’ property to your task.

{‘start’: 18, ‘end’: 26, ‘label’: ‘DATE’}

so the problem is /

If I want to edit all manually, it is very time consuming, is there any suggestion?

Many thanks

(Ines Montani) #11

Is it posssible that the logic you use inserts or doesn’t trim whitespace at the beginning of the strings and causes an off-by-one error? Your logic seems to consistently produce spans that start at the space character before the words you want to annotate. For example:

text = " On 1595 October 30 at 8h 20m, they found Mars at 17° 48’ Taurus, with a diurnal motion of 22’ 54” ^15." 
print(text[3:18])
# ' 1595 October 3'

Note the space at the beginning of the text. If you remove that, the character offsets are correct. I’d definitely recommend to try and and fix examples like that. This has absolutely nothing to do with Prodigy or spaCy – it’s really just about having clean data so you’ll be able to train a good model later on.

(Ya) #12

It seems the problem is solved :slight_smile: thank you very much, at least till sentence 500 we do not have any problem
I used this rule

r'\d{4} [A-Z][a-z.]+ \d{2}|\d{2} [A-Z][a-z.]+ \d{4}|\d{4} [A-Z][a-z.]+ \d{1}

Then I only

. Is there any way that before using the prodigy we see the inconsistency?

I used this rule

r'\d{4} [A-Z][a-z.]+ \d{2}|\d{2} [A-Z][a-z.]+ \d{4}|\d{4} [A-Z][a-z.]+ \d{1}

Then I only

(Ines Montani) #13

Glad to hear it worked! :+1: And yes, you can use this script here, which I linked in my previous post :slightly_smiling_face:

(Ya) #14

many thanks for your responses and hints, It works for a simple pattern. (I manually deleted sign with problem Like:

January 24/25 ---> to January 24 till 25 (which is not very ideal, since I am not sure that what writer means 0

Since I would have more kind of pattern, It is very possible that I faced with sign like “,?/” after those, is there any other way that you can recoomencd

second question is as far as I got after finishing annotation, I need to Export my data

prodigy db-out your_dataset > annotations.jsonl

could you let me know workflow after this? I need to train a model for each label? am, I right?

then in some way merge the annotated data together?

Many many thanks for your smart answers
Best
Mohammad