Examples from stream are shown twice

Hi!

We are having trouble running an annotation task with a custom recipe. Unfortunately, some examples from the stream are shown multiple times to the annotator. Therefore, we are ending up with lots of duplicates in the data.

To demonstrate the problem, we created a minimal example.

We are using the current prodigy version 1.11.4 and the following recipe:

import prodigy
import spacy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens


@prodigy.recipe('ner_ate',
                dataset=("The dataset to save results", "positional", None, str),
                file_path=("The input data to use", "positional", None, str))
def ner_ate(dataset, file_path):
    def get_stream(fp):
        records = JSONL(fp)
        # use dummy image
        src = 'https://prodi.gy/static/social_dark-73aae237522610d930c61b32422092ef.jpg'
        for record_num, record in enumerate(records):
                # examples with text for ner.manual and html code with image for html block
                # record number in meta data to keep track of order
                yield {"text": record["text"],  # text for ner.manual block
                       "html": f'<img src={src} alt="" height="200" /></p>',
                       "meta": {"record_num": record_num}}

    nlp = spacy.blank("en")  # blank spaCy pipeline for tokenization
    stream = get_stream(fp=file_path)  # set up the stream
    stream = add_tokens(nlp, stream)  # tokenize the stream for ner.manual

    return {
        "dataset": dataset,  # the dataset to save annotations to
        "view_id": "blocks",  # set the view_id to "blocks"
        "stream": stream,  # the stream of incoming examples
        "config": {
            "host": "0.0.0.0",
            "port": 8182,
            "labels": ["Aspect"],
            "blocks": [{"view_id": "ner_manual"},
                       {"view_id": "html"}]
        }
    }

For testing purposes, we were using the news_headlines dataset
and ran prodigy in a docker container with the following simple dockerfile:

FROM python:3.7

ENV PRODIGY_HOME ./app
ENV PRODIGY_LOGGING basic

RUN pip3 install --upgrade pip && \
    pip3 install spacy==3.1.2

COPY wheel/prodigy-1.11.4-cp37-cp37m-linux_x86_64.whl ./wheel/
RUN pip3 install wheel/prodigy-1.11.4-cp37-cp37m-linux_x86_64.whl \
    && rm -rf wheel/prodigy-1.11.4-cp37-cp37m-linux_x86_64.whl
RUN python3 -m spacy download en_core_web_sm

COPY data ./data/
COPY recipes ./recipes/

CMD python3 -m prodigy ner_ate ate_highinv ./data/news_headlines.jsonl.txt -F ./recipes/ner_ate.py

EXPOSE 8182

There is no extra config.json.

Using the record_num in the meta data, we were able to keep track of the order, in which the examples are shown. There were multiple points in time, were prodigy jumped back, e.g., from record_num 35 to record_num 11 and repeated some examples before continuing at record_num 36. The result was a dataset of 240 annotations (see screenshot below), even though there are only 200 examples in the dataset. Meaning, 40 examples (20% of the data) were repeated and shown twice to the user.

Do you have any idea what could be going wrong here?

Thanks a lot in advance!
Niclas

2 Likes

Thanks for the detailed report and the example!

One quick question about your annotation process: Did this all happen within the same annotation session or did you ever restart the server? And if you look at the duplicate examples in the data, are the _input_hash and/or _task_hash values identical?

Hi @ines, thanks for the fast reply.

Yes, the annotations were all done in one session. I clicked through the 200 examples in one go without interruption. But I also didn't hold down the "accept" key or anything like that.

Here is an example of a duplicated record from the dataset. They are identical except for the timestamp. record_num 127 was correctly displayed at position 127 and then repeated at position 151.

                           text                 meta  _input_hash  _task_hash  \
127  Digital Muse for Beat Poet  {'record_num': 127}   -805557754   893792064   
151  Digital Muse for Beat Poet  {'record_num': 127}   -805557754   893792064   

     answer  _timestamp  
127  accept  1633723793  
151  accept  1633723803

Hi,
thank you for the amazing tool :pray: I have unfortunately the same problem of duplicates in NER annotation and would appreciate any advice to solve it.

Hi! Are you using the same version and observing the same problem, i.e. examples with identical task hashes and input hashes being repeated in the same session?

We have a new version coming up and I'll update this thread once it's out. It might have an impact so definitely try it and see if you can still encounter duplication.

Hi! I have the same issue, multiple repeating examples are shown, but in our case we use the multi-session mode. Our team noticed it happens when more than one user is annotating at the same time. When just one person is annotating, after a few examples it stops repeating. Running on version 1.11.4, it's a ner.manual recipe, here is the config:

{
    "theme": "basic",
    "buttons": ["accept", "reject", "ignore", "undo"],
    "batch_size": 20,
    "history_size": 20,
    "port": 8000,
    "host": "0.0.0.0",
    "cors": true,
    "db": "sqlite",
    "db_settings": {
        "sqlite": {
        "name": "prodigy.db",
        "path": "/app/database"
        }
    },
    "api_keys": {},
    "validate": true,
    "auto_exclude_current": true,
    "instant_submit": false,
    "feed_overlap": false,
    "ui_lang": "pt",
    "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
    "show_stats": false,
    "hide_meta": false,
    "show_flag": false,
    "instructions": false,
    "swipe": false,
    "swipe_gestures": { "left": "accept", "right": "reject" },
    "split_sents_threshold": false,
    "html_template": false,
    "global_css": null,
    "javascript": null,
    "writing_dir": "ltr",
    "show_whitespace": false
}

some examples from db-out:

{'text': 'As pessoas reclamam,...', '_input_hash': 1683003174, '_task_hash': -82705531, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'As pessoas reclamam,...', '_input_hash': 1683003174, '_task_hash': -82705531, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'Destarte, é inegável...', '_input_hash': 916516496, '_task_hash': -354755047, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'Destarte, é inegável...', '_input_hash': 916516496, '_task_hash': -354755047, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'Esse cen√°rio antag√īn...', '_input_hash': 1442234240, '_task_hash': -1947921422, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'Esse cen√°rio antag√īn...', '_input_hash': 1442234240, '_task_hash': -1947921422, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'Logo, para o sociólo...', '_input_hash': 989722103, '_task_hash': 1152576164, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'Logo, para o sociólo...', '_input_hash': 989722103, '_task_hash': 1152576164, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'No início de 2021 oc...', '_input_hash': -555826576, '_task_hash': 1621368279, '_annotator_id': 'tagger_v2-marinastri'}
{'text': 'No início de 2021 oc...', '_input_hash': -555826576, '_task_hash': 1621368279, '_annotator_id': 'tagger_v2-marinastri'}
{'text': 'O acesso ao conte√ļdo...', '_input_hash': -718039749, '_task_hash': -672062954, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'O acesso ao conte√ļdo...', '_input_hash': -718039749, '_task_hash': -672062954, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'Primeiramente, como ...', '_input_hash': 1883889543, '_task_hash': 27718084, '_annotator_id': 'tagger_v2-guiij'}
{'text': 'Primeiramente, como ...', '_input_hash': 1883889543, '_task_hash': 27718084, '_annotator_id': 'tagger_v2-guiij'}

Just released v1.11.5 that includes a fix that's likely relevant. Could you re-run your process with the new version and see if it resolves the problem?