ner.batch-train ERROR: Trying to set conflicting doc.ents

I am getting below error when trying to do ner.batch-train for a new entity. I get this error when trying to train a blank model as well as when training an existing spacy model (like en_core_web_sm).

This error is kind of strange as it is throwing error even when the entities in conflict are same.

File "gold.pyx", line 715, in spacy.gold.GoldParse.init
File "gold.pyx", line 925, in spacy.gold.biluo_tags_from_offsets
ValueError: [E103] Trying to set conflicting doc.ents: '(2, 43, 'adminagent')' and '(2, 43, 'adminagent')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

I understand the intent behind the error, but I am a bit puzzled as there is only one entity in the annotations. Here is the annotation that it is throwing error at.

{'text': ' CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, as Administrative Agent',
'_input_hash': -1415453208,
'_task_hash': 354718826,
'tokens': [{'text': ' ', 'start': 0, 'end': 2, 'id': 0},
{'text': 'CREDIT', 'start': 2, 'end': 8, 'id': 1},
{'text': 'SUISSE', 'start': 9, 'end': 15, 'id': 2},
{'text': 'AG', 'start': 16, 'end': 18, 'id': 3},
{'text': ',', 'start': 18, 'end': 19, 'id': 4},
{'text': 'CAYMAN', 'start': 20, 'end': 26, 'id': 5},
{'text': 'ISLANDS', 'start': 27, 'end': 34, 'id': 6},
{'text': ' ', 'start': 35, 'end': 37, 'id': 7},
{'text': 'BRANCH', 'start': 37, 'end': 43, 'id': 8},
{'text': ',', 'start': 43, 'end': 44, 'id': 9},
{'text': 'as', 'start': 45, 'end': 47, 'id': 10},
{'text': 'Administrative', 'start': 48, 'end': 62, 'id': 11},
{'text': 'Agent', 'start': 63, 'end': 68, 'id': 12}],
'_session_id': 'adminagent-default',
'_view_id': 'ner_manual',
'spans': [{'start': 2,
'end': 43,
'token_start': 1,
'token_end': 8,
'label': 'adminagent'}],
'answer': 'accept'}

I have to annotate large amount of corpus and large number of entities, so the steps that I took are as follows:

  1. Annotated few samples using ner.manual
  2. Used ner.batch-train to create a seed model base don annotations from (1)
  3. Used seed model from (2) to create another set of annotations for entity by using ner.teach binary annotation recepie

I have used same dataset for 1 & 3 annotations, and now trying to train a model based on annotations from step 3 using ner.batch-train when I get the conflict error.

Pls advise.

Hi! The problem here likely happens because you combined the two datasets, which is slightly unideal: your manual annotations are typically more or less complete gold-standard annotations, while the ner.teach annotations are binary accept/reject examples. So going forward, you probably want to use separate datasets here.

It looks like Prodigy currently doesn't filter out duplicates when it merges annotated spans, so if you've labelled an example once in manual mode and once in binary mode, the training example will end up with two duplicate spans, and spaCy will complain, because overlaps usually indicate a problem with the data.

The easiest workaround would probably be to export the dataset or load it in your script, merge all spans and then loop over the examples and filter out the duplicates. Untested, but something like this should work:

from prodigy.components.db import connect
from prodigy.models.ner import merge_spans

db = connect()
examples = db.get_dataset("your_dataset")
examples = merge_spans(examples)
for eg in examples:
    seen_spans = set()
    filtered_spans = []
    for span in eg.get("spans", []):
        start_end = (span["start"], span["end"])
        if start_end not in seen:
            filtered_spans.append(span)
            seen.add(start_end)
    eg["spans"] = filtered_spans

db.add_dataset("your_filtered_dataset")
db.add_examples(examples, datasets=["your_filtered_dataset"])

Thanks @ines for the quick response. You suggestion worked and resolved the issue. I am able to batch-train after fixing the annotations.

Spacy/Prodigy team rocks!

@ines : I am getting another error, which is different, but similar in nature. Prodigy gave me couple of overlapping suggestions while annotating (ner.teach) and both were wrong, so I rejected them. However, Prodigy then threw below error highlighting the spans that are overlapping with an exclamation mark ("!") ahead of the entity name - not sure what that means.

Prodigy is not saving the annotations anymore from that point onward. This time I am using fresh dataset when starting ner.teach, so there is no possibility of any mixup.

Error:

File "cython_src\prodigy\core.pyx", line 137, in prodigy.core.Controller.receive_answers
File "cython_src\prodigy\models\ner.pyx", line 344, in prodigy.models.ner.EntityRecognizer.update
File "cython_src\prodigy\models\ner.pyx", line 441, in prodigy.models.ner.EntityRecognizer._update
File "gold.pyx", line 715, in spacy.gold.GoldParse.init
File "gold.pyx", line 925, in spacy.gold.biluo_tags_from_offsets
ValueError: [E103] Trying to set conflicting doc.ents: '(852, 870, '!adminagent')' and '(852, 882, '!adminagent')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

Text:

'The occurrence of any of the following events shall be an Event of Default under this Note: (i) the Borrower shall fail to pay any principal of any Loan (including scheduled installments, mandatory prepayments or the payment due at maturity) on the date on which such principal becomes due in accordance with the terms hereof and the Loan Agreement; (ii) the Borrower or other applicable Loan Party shall fail to pay any interest on any Loan or any other amount owing hereunder or under the other Loan Documents on the date on which such interest or other amount becomes due in accordance with the terms hereof or thereof and such failure shall continue unremedied for a period of three (3) business days thereafter; (iii) an Event of Default as defined herein, in any of the other Loan Documents or in any other agreement between any Loan Party and PNC Bank, National Association or any of its subsidiaries or affiliates; (iv) a default or event of default under or as defined in any other instrument or document between any Loan Party and PNC Bank, National Association or any of its subsidiaries or affiliates which continues beyond any applicable grace, notice or cure period therein provided or if none is provided, beyond thirty (30) days thereafter; (v) the filing by or against any Loan Party of any proceeding in bankruptcy, receivership, insolvency, reorganization, liquidation, conservatorship or similar proceeding (and, in the case of any such proceeding instituted against any Loan Party, such proceeding is not dismissed or stayed within 45 days of the commencement thereof, provided that the Bank shall not be obligated to advance additional funds hereunder during such period); (vi) any assignment by any Loan Party for the benefit of creditors; (vii) any levy, garnishment, attachment or similar proceeding is instituted against any material property of any Loan Party held by or deposited with the Bank which has not been vacated, discharged, stayed or bonded pending appeal within 30 days from the entry thereof; (viii) a default with respect to any other indebtedness of any Loan Party for borrowed money in excess of $50,000 that continues beyond any applicable grace, notice or cure period therein provided, if the effect of such default is to cause or permit the acceleration of such debt; (ix) the commencement of any foreclosure or forfeiture proceeding, execution or attachment against any material portion of the Collateral which has not been vacated, discharged, stayed or bonded pending appeal within 30 days from the entry thereof; (x) the entry of a final judgment against any Loan Party and the failure of such Loan Party to discharge the judgment within thirty (30) days of the entry thereof; (xi) any change in any Loan Party s business, assets, operations, financial condition or results of operations that could reasonably be expected to result in a Material Adverse Change; (xii) any Loan Party ceases doing business as a going concern; (xiii) any representation or warranty made by any Loan Party to the Bank in any Loan Document or any other documents now or in the future evidencing or securing the obligations of any Loan Party to the Bank, is false, erroneous or misleading in any material respect as of the time it was made or furnished; or (xiv) the revocation or attempted revocation, in whole or in part, of any guaranty by any Loan Party.'

Thanks, this sounds like it might be a bug with how spaCy is updated in the loop (and that only seems to affect some cases). We're investigating!

Which versions of Prodigy and spaCy are you running btw?

Edit: Okay, I think I have an explanation. The problem seems to occur if your data contains inconsistent annotations – for example, if you get two overlapping suggestions and you accept both. Prodigy previously ignored this, but a recent update to spaCy meant that spaCy is now stricter about this (which is usually good) and raises an error here.

Actually, I rejected both prodigy suggestions. The suggestions were made by ner.teach recipe with a model in the loop. I also started with a fresh dataset, so there is no possibility of any contamination from manual annotations that I made to build the seed model.

spaCy v 2.2.2
Prodigy v 1.8.4

The latest version of Prodigy isn't officially compatible with spaCy v2.2 – so this is likely the problem here. spaCy v2.2 introduces stricter and backwards-incompatible behaviour around overlapping entities.

We will be releasing an officially compatible version of Prodigy soon – we typically wait until the latest version is confirmed stable and any remaining bugs (library and pre-trained models) are resolved. (Otherwise it'd be irresponsible to tell Prodigy users to upgrade and retrain all their models.)

I spent quite a bit of time to get spaCy and Prodigy to work together. Prodigy downgraded spaCy on installation, and then spaCy stopped working completely. I could not load any of the spacy model (enc_core_web_*). I downloaded them again with lower spaCy version, but no joy. After trying several combinations, uninstall/install, etc. the only sequence that worked for me is as follows:

  1. first install prodigy,
  2. upgrade spaCy, and
  3. download all spacy models (en_core_web_*) again

That is how I ended up with the combination of spaCy and Prodigy versions. I know you have not released Prodigy for spaCy 2.2, yet. If this issue is due to version incompatibility then I look forward to a fix/resolution/suggestions in the new version.

Many thanks for the quick responses and awesome support that you are providing to the community.

Yes, this is likely what happened then! Installing one version on top of another pre-installed version (spaCy in your environment and then spaCy via Prodigy) can sometimes cause stale leftover artifacts (especially if the packages have C extensions) or dependency incompatibilities (depending on pip version etc). This is something we can't really influence, unfortunately. So the best way to prevent problems is usually to work with virtual environments everywhere.