Hi @Fangjian,
I can't see the full traceback but I suspect the error originates in this line in the ner.correct
recipe, where the recipe tries to process the existing (manually annotated) spans with spaCy
for span in eg.get("spans", []):
spans.append(doc.char_span(span["start"], span["end"], span["label"]))
The issue here is that doc.char_span()
can return None
when it fails to create a valid span (due to tokenization mismatches or invalid character positions), but the code later tries to access .end
on these None
objects.
This means that your thedataset
has some spans that are not aligned with the tokenization. I can see you used the highlight-chars
in your ner.manual
and that's the likely cause of the misalignment. highlight-chars
only modifies the spans, but it doesn't modify the underlying tokenization. Please see the warning box in the docs here. The main use case for highlight-chars
is to systematically collect examples for modifying the tokenizer's rules or inform the data preprocessing procedures.
If you are not planning to modify the tokenizer (e.g. by adding custom rules based on the cases you needed to use the character-level tokenization) there's not much point in annotating with highlight-chars
. Your span annotations must be aligned with tokens, otherwise the model will never be able to predict these spans.
I recommend you filter out the misaligned examples and reannotate them using en_core_sci_scibert
like you did but without the highlight-chars
option.
You can use this simple script that tests whether a valid spaCy span can be formed given the tokenization and the annotated spans offsets:
import spacy
import srsly
def clean_annotations(input_file, output_file_valid, output_file_reannot, model_name):
nlp = spacy.load(model_name)
to_reannotate = []
valid_examples = []
input_data = srsly.read_jsonl(input_file)
for example in input_data:
if 'spans' not in example:
valid_examples.append(example)
continue
# Process the text to check span validity
doc = nlp(example['text'])
has_invalid_span = False # Flag to track if any span is invalid
for span in example['spans']:
# Check if char_span would return None
char_span = doc.char_span(span['start'], span['end'], span['label'])
if char_span is None:
print(f"Detected invalid span: {span} in text: '{example['text'][:100]}...'")
to_reannotate.append(example)
has_invalid_span = True
break
# Only add to valid if no invalid spans were found
if not has_invalid_span:
valid_examples.append(example)
srsly.write_jsonl(output_file_valid, valid_examples)
srsly.write_jsonl(output_file_reannot, to_reannotate)
# Usage
clean_annotations('thedataset.jsonl', 'valid_annotations.jsonl', 'examples_to_reannotate.jsonl', 'en_core_sci_scibert')
You can export your thedataset
with the db-out
command which will save it as a jsonl file on disk. The script will produce two files valid_annotations.jsonl
and examples_to_reannotate.jsonl
. Once you have reannotated examples_to_reannotate.jsonl
with ner.manual
you can then merge the reannotated dataset with your valid_annoations.jsonl and use that for the next stage.