How will merge_spans treat Negative examples (no annotations)


Hi guys,

I followed this article: Merging annotations from different datasets to deal with the following scenario:

Trained Dataset A w/ Label A off training data TD-A
Trained Dataset B w/ Label B off training data TD-B (TD-A and TD-B don’t actually share any input_hash… they are targetted subsets of the overall data primed to make best use of labellers time for A and B respectively)

Dataset A is only labelled w/ Label A, and same for B…

I ran the snippet of code outlined in Support link above to generate Dataset C (A + B). I did this because I want to train a combined model for A + B (instead of having separate models. which I guess I could do… but would be more performant at prediction time to have one, right? expecially since I plan on training a dozen or so different labels)

what happens in the following:
if I Accepted the “absence” of label in Dataset A… this tells algo good info… specifically that the text block is devoid of the label… very important in nuanced/borderline cases for algo to learn…I’ll call this a “negative” example, for purposes of the rest of this question.

when Dataset C is trained on for both labels e, g: --label A,B. How will the algo (from Dataset C, synthesized from A,B, but no longer keeping any information which source of the data it originated from) know that, when training for the B labels , that a Dataset A “negative” example (e.g. Accepted devoid of label) … won’t also count against the B label as a negative example? How will it know that this “absence” information is meant to steer the A labelling, and not the B labelling? in other words… Won’t the algo interpret this absence/devoid of spans as a negative example for B as well? What if that text (originated from A ) really did have a B label present? Won’t it be learning incorrectly?

Any ideas how to remedy this? Thanks

There might be a limitation in the intermediate dataset processing code, especially in the example snippets I provided, but I’m fairly sure the part which does the training will be able to handle this. So, if there’s a problem here, it should be fairly easy to solve when you’re merging the datasets.

As a quick recap, the NER training code parses each sentence twice: once to find the most likely parses, and again to find the most likely parses subject to some annotation constraints. It then updates the weights so that the constrained parses are more likely.

The function which does the constrained parsing is the prodigy.models.ner.EntityRecognizer.predict_best() method. This lets you provide spans marked with "accept" or "reject". The method is called within the prodigy.models.ner.EntityRecognizer.batch_train() method.

You should be able to check that the data getting passed into the batch train method has all the constraints the model should know able. If some information is missing, it should be easy to add those constraints back in, as spans that have been rejected.

Hi…Sorry I am a little confused… High level, I have two separate models for two separate labels, trained with totally different data (no rows in common)…

When I perform a “merge” of the datasets (to try to build a single model), it is inaccurately portraying the Accept/no-span from one of the datasets (e.g. I will call a “negative” example) counting against both datasets in the model, even though I never actually verified if the second label did/or/did/not exist for the negative examples on the “other” dataset).

Is there any way to do what I am trying to do, other than:

  1. Continue to have to separate models

  2. Go through all the “negative” samples and re-annotate against the “other” datasets, so I put the annotations in if they truly exist.


It might be helpful to think about how the data’s getting passed into spaCy, to actually update the model.

Let’s say we have three labels: POLITICS, SCIENCE and ECONOMICS. Since have three labels, spaCy’s TextCategorizer model will be outputting a vector of three numbers. The logistic function 1. / (1 + math.exp(6.)) is used to squash each dimension of this output vector into the range [0, 1], so that each dimension can be read as a probability of that class. For each label, we’ll have an annotation that has three possible values: True, False, or None.

Here are some examples of gold-standard labellings, along with the vector of ternary values they’d translate to:

'''Example valid gold-standards for one text, to train spaCy's textcat'''

# We accept 'science' as True, and reject 'politics' and 'economics'
cats = {"POLITICS": False, "SCIENCE": True, "ECONOMICS": False}
target = [0, 1, 0]
# Alternatively, we might have rejected all three labels --- maybe none apply
cats = {"POLITICS": False, "SCIENCE": False, "ECONOMICS": False}
target = [0, 0, 0]
# Here we have two true labels
cats = {"POLITICS": False, "SCIENCE": True, "ECONOMICS": True}
target = [0, 1, 1]
# Here we know that 'politics' is True, but don't know whether science or economics are true.
cats = {"POLITICS": True}
target = [1, None, None]
# Here we know that 'politics' is False, but don't know whether science or economics are true.
cats = {"POLITICS": False}
target = [False, None, None]

So. Let’s say you answered accept/reject/ignore questions on your two datasets. You have data that’s like this:

# Dataset 1
{"text": "...", "label": "A", "answer": "accept"}
{"text": "....", "label": "A": "answer": "reject"}
# Dataset 2
{"text": "...", "label": "B", "answer": "accept"}
{"text": "....", "label": "B": "answer": "reject"}

If the labels aren’t mutually exclusive — that is, if a text might be labelled so that both A and B are true — we would make the cats dictionaries like this:

{"A": True}
{"A": False}
{"B": True}
{"B": False}

The information about the other label is missing, so the update will include zeros for part of the gradient. If we do have mutually exclusive labels, we could make the gold-standard like this:

{"A": True, "B": False}
{"A": False, "B": True}
{"B": True, "A": False}
{"B": False, "A": True}

This makes the model learn more from the data, since we’re using what we know about the structure of the problem to fill in the missing information.

Is that clearer?

Yes . thank you… so it seems you are saying that by me merging the datasets , which would yield results like you show above :
{“A”: True}
{“A”: False}
{“B”: True}
{“B”: False}

(only, I am using NER labelling, so it would also include the character span indexes, right?)…
and then the model should still be able to come up with quality results against all the labels because you are not telling the model a decision on the unlabelled content, right?