I’m training a NER model with several entity types. I tagged each entity type as a separate task, so I now have several versions of my dataset, each annotated with a different entity type.
I want to merge the datasets so I have a single dataset with all the annotations. However, running prodigy db-merge
concatenates the datasets rather than merging.
When I train a NER model on the resulting dataset, Prodigy drops most of the samples and created a model with the first NER tag found.
Is this expected behaviour?
Welcome to the forum @jhandsel
!
Indeed, the merge_db
command just concatenates the dataset - it does not merge the annotated spans on the same input_hash
so what you're observing after running db-merge
is correct.
Prodigy is only merging the annotations before training with train
and exporting with data-to-spacy
. These two commands also take care of resolving the conflicting annotations e.g. overlapping spans by selecting the longer one. So you might as well store your datasets separately and only merge when you're ready to train.
Now the reason why some annotations are being ignored could be that the annotations produced in different annotation rounds resulted in overlapping spans which is not allowed in NER (each token can only belong to one entity). If you need a custom conflict resolution logic you'd need to merge the spans via custom function. For this you want to process your merged dataset by grouping examples by _input_hash
and merging all spans to single list:
from prodigy.components.db import connect
from prodigy.models.ner import merge_spans
db = connect() # connect to the DB using the prodigy.json settings
datasets = ['dataset_one', 'dataset_two', 'dataset_three']
examples = []
for dataset in datasets:
examples += db.get_dataset_examples(dataset) # get examples from the database
merged_examples = merge_spans(examples)
Now you could process merged examples by trying to create spaCy entities from span annotations, if there's a conflict spaCy will raise en error. You can use a function similar to this one:
def check_span_conflicts(example, nlp):
doc = nlp.make_doc(example['text'])
# Create spaCy spans from Prodigy spans
spans = []
for span in example.get('spans', []):
spacy_span = doc.char_span(span['start'], span['end'], label=span['label'])
if spacy_span:
spans.append(spacy_span)
else:
raise ValueError(f"Span could not be created. Span offsets misaligned with the tokenization in example with input hash {example.get('_input_hash'}")
# Try to set entities - this will fail if there are conflicts
try:
doc.set_ents(spans)
except Exception as e:
raise ValueError(f"Conflicting spans detected in example with input hash {example.get('_input_hash'}")
Once you've detected problematic examples you can add a logic for custom conflict resolution.
If what you need is overlapping spans, you might consider span categorization instead.
Thanks for confirming the behaviour of db-merge
.
There aren't many cases of overlapping spans in my annotations, so I'm not sure why train
failed to auto-merge. I am working with Japanese data — could this be an issue?
I've noticed that other components don't behave as expected with Japanese. For example, the review
interface sometimes considers identically annotated examples as distinct. And the output of print-dataset
has sentences chopped and shifted, with some tags missing.
As a workaround, I managed to combine my datasets using a script similar to the one you provided:
from prodigy import set_hashes
from prodigy.components.db import connect
from prodigy.models.ner import merge_spans
db = connect() # connect to the DB using the prodigy.json settings
datasets = ['dataset_one', 'dataset_two', 'dataset_three']
examples = []
for dataset in datasets:
examples += db.get_dataset(dataset)
merged_examples = merge_spans(examples)
rehashed_examples = [set_hashes(eg, overwrite=True) for eg in merged_examples]
db.add_dataset('merged_dataset')
db.add_examples(rehashed_examples, datasets=['merged_dataset'])
I was able to train a model with on the output that included all of the expected NER categories.
No, if the tokenizer used during the annotation is the same as the one used for the train
command, the language should not matter. It's not only the overlapping spans that would be a problem for NER. The misaligned spans (spans offsets not corresponding to the tokenization) would also be rejected.
Perhaps you were using ner.manual
with the --highlight-chars
option? That could definitely have introduced misaligned spans as it doesn't retokenize the example (as per docs here).
. And the output of print-dataset
has sentences chopped and shifted, with some tags missing.
print-dataset
prints exactly what's in spans - it should even print correctly the overlapping and misaligned spans.
Could you provide an example of such "chopped and shifted" example with "tags missing"? The source dataset entry and a screen shot how looks like when printed with print-dataset
would be great - thank you.
For example, the review
interface sometimes considers identically annotated examples as distinct.
I'm afraid I'm not reproducing this problem either. I'm suspecting there must be differences in spans such as inclusion/exclusion of the whitespace which are not easily visible. Since it seems to be data related issue, again, an example of an identical annotation that renders wrong in review
would be great.