Annotation pipeline - chaining multiple annotation task types

alan-hogue · February 14, 2024, 9:02pm

I would like to set up a chain of annotation tasks where the annotations resulting from the first stage feed into and are visible to the annotator at the next stage.

Specifically, I'd like to do entity linking with NER spans that have already gone through a correction step. What I want is similar to what they show in the video tutorial for entity linking here. But instead of having the NER handled by a model, I would like to have the NER labels imported from a previous ner.correct stage.

To make this question more general, what if after that I wanted to do coreference resolution, but I only wanted to target those entities that were linked in the linking stage and ignore the others? Can I take those annotations from the linking stage and make them visible in the coreference resolution stage?

I am not sure how to accomplish this kind of annotation task chaining. Are there recent, up-to-date examples that I can look at somewhere?

magdaaniol · February 16, 2024, 2:55pm

Welcome to the forum @alan-hogue

It is definitely possible to "propagate" annotations from one task to another.

The return statement of a Prodigy recipe specifies which annotation interface is supposed to be used via the view_id attribute. This view_id also determines the required data structure of each example in the annotation stream. If there is a match between this requirement and the examples Prodigy will render it.

In the case of ner.manual and choice used in the entity linking tutorial, this process is automatic because choice annotation interface will render NER spans, if they are available in the example.
So the only modification you'd have to do is to provide the pre-annotated dataset for the recipe to use.
Assuming your curated NER dataset is called "ner-curated", you'd first have to store it on disk by running:

prodigy db-out ner-curated ner-curated.jsonl

and then modify the tutorial's Prodigy recipe so that it uses this dataset rather than the model's annotations.
Here's the updated version of the tutorial's recipe that does that (I added #UPDATED on the modified/new lines):

import spacy
from spacy.kb import KnowledgeBase

import prodigy
from prodigy.models.ner import EntityRecognizer
from prodigy.components.stream import get_stream #UPDATED
from prodigy.components.filters import filter_duplicates

import csv
from pathlib import Path


@prodigy.recipe(
    "entity_linker.manual",
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a .jsonl file", "positional", None, Path), #UPDATED
    nlp_dir=("Path to the NLP model with a pretrained NER component", "positional", None, Path),
    kb_loc=("Path to the KB", "positional", None, Path),
    entity_loc=("Path to the file with additional information about the entities", "positional", None, Path),
)
def entity_linker_manual(dataset, source, nlp_dir, kb_loc, entity_loc):
    # Load the NLP and KB objects from file
    nlp = spacy.load(nlp_dir)
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=1)
    kb.load_bulk(kb_loc)

    # Read the pre-defined CSV file into dictionaries mapping QIDs to the full names and descriptions
    id_dict = dict()
    with entity_loc.open("r", encoding="utf8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=",")
        for row in csvreader:
            id_dict[row[0]] = (row[1], row[2])

    # Initialize the Prodigy stream by loading the preannotated dataset
    stream = get_stream(source) #UPDATED

    # For each NER mention, add the candidates from the KB to the annotation task
    stream.apply(_add_options, stream=stream, kb=kb, id_dict=id_dict) #UPDATED to use the newer API
    stream.apply(filter_duplicates, stream=stream, by_input=False, by_task=True) #UPDATED to use the newer API

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "choice", # choice view if will render NER spans if present in the input
        "config": {"choice_auto_accept": True},
    }


def _add_options(stream, kb, id_dict):
    """ Define the options the annotator will be given, by consulting the candidates from the KB for each NER span. """
    for task in stream:
        text = task["text"]
        for span in task["spans"]:
            start_char = int(span["start"])
            end_char = int(span["end"])
            mention = text[start_char:end_char]

            candidates = kb.get_candidates(mention)
            if candidates:
                options = [{"id": c.entity_, "html": _print_url(c.entity_, id_dict)} for c in candidates]

                # we sort the options by ID
                options = sorted(options, key=lambda r: int(r["id"][1:]))

                # we add in a few additional options in case a correct ID can not be picked
                options.append({"id": "NIL_otherLink", "text": "Link not in options"})
                options.append({"id": "NIL_ambiguous", "text": "Need more context"})

                task["options"] = options
                yield task


def _print_url(entity_id, id_dict):
    """ For each candidate QID, create a link to the corresponding Wikidata page and print the description """
    url_prefix = "https://www.wikidata.org/wiki/"
    name, descr = id_dict.get(entity_id)
    option = "<a href='" + url_prefix + entity_id + "'>" + entity_id + "</a>: " + descr
    return option

That's it! You just need to read in the pre-annotated dataset, add KB options to it and Prodigy will render the NER spans.

To mix other types of annotation interfaces, including fully custom ones you can use blocks. The core of the solution is to match the view_id expected data structure and the example data structure. The stream of examples can be modified outside the recipe (e.g. if it's another's recipe output) or within the recipe function via Streams apply method.

alan-hogue · February 16, 2024, 6:08pm

Wow, thanks so much for the extensive reply. This is really helpful.

Just one follow up question since I have found the docs slightly confusing in parts. When you say:

The core of the solution is to match the view_id expected data structure and the the example data structure.

I take it that the view_id expected data structure is what is shown here, where it is called "JSON Task Format". Is that correct?

In other words, if I want to have something visible in a "spans" interface, I would need them to appear in here:

"spans": [
    {"start": 7, "end": 16, "token_start": 2, "token_end": 3, "label": "REF"},
    {"start": 25, "end": 37, "token_start": 5, "token_end": 7, "label": "REASON"},
    {"start": 33, "end": 37, "token_start": 7, "token_end": 7, "label": "ATTR"}
  ]

Is that right? Thanks!

magdaaniol · February 19, 2024, 2:49pm

Glad I could help
And yes, the JSON Task Format is what we use in the docs to show the expected data structure for each annotation interface.
You're also right about the the location of spans. Just to leave things crystal clear, you'll also need tokens as the spans make reference to tokens via token_start and token_end. So the full representation would be like in the docs:

{
  "text": "I like baby cats because they're cute",
  "tokens": [
    {"text": "I", "start": 0, "end": 1, "id": 0, "ws": true},
    {"text": "like", "start": 2, "end": 6, "id": 1, "ws": true},
    {"text": "baby", "start": 7, "end": 11, "id": 2, "ws": true},
    {"text": "cats", "start": 12, "end": 16, "id": 3, "ws": true},
    {"text": "because", "start": 17, "end": 24, "id": 4, "ws": true},
    {"text": "they", "start": 25, "end": 29, "id": 5, "ws": false},
    {"text": "'re", "start": 29, "end": 32, "id": 6, "ws": true},
    {"text": "cute", "start": 33, "end": 37, "id": 7, "ws": false}
  ],
  "spans": [
    {"start": 7, "end": 16, "token_start": 2, "token_end": 3, "label": "REF"},
    {"start": 25, "end": 37, "token_start": 5, "token_end": 7, "label": "REASON"},
    {"start": 33, "end": 37, "token_start": 7, "token_end": 7, "label": "ATTR"}
  ]
}

alan-hogue · February 20, 2024, 10:36pm

Hi there, sorry but I am encountering a problem running this.

After running the scripts to create the NLP model and so on, this happens:

prodigy entity_linker.manual ner_linked ner_test.jsonl my_output/my_nlp my_output/my_kb assets/entities.csv -F scripts/el_recipe.py

/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'en_core_web_lg' (3.5.0) was trained with spaCy v3.5.0 and may not be 100% compatible with the current version (3.7.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
                 ^^^^^^^^^^^^^^^^^^^^
  File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
  File "/Users/alan/repos/agolo/prodigy-projects/tutorials/nel_emerson/scripts/el_recipe.py", line 24, in entity_linker_manual
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=1)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "spacy/kb/kb.pyx", line 27, in spacy.kb.kb.KnowledgeBase.__init__
TypeError: [E1046] KnowledgeBase is an abstract class and cannot be instantiated. If you are looking for spaCy's default knowledge base, use `InMemoryLookupKB`.

If I replace 'KnowledgeBase' with 'InMemoryLookupKB', I get this error:

prodigy entity_linker.manual ner_linked ner_test.jsonl my_output/my_nlp my_output/my_kb assets/entities.csv -F scripts/el_recipe.py

/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'en_core_web_lg' (3.5.0) was trained with spaCy v3.5.0 and may not be 100% compatible with the current version (3.7.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
                 ^^^^^^^^^^^^^^^^^^^^
  File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
  File "/Users/alan/repos/agolo/prodigy-projects/tutorials/nel_emerson/scripts/el_recipe.py", line 25, in entity_linker_manual
    kb.load_bulk(kb_loc)
    ^^^^^^^^^^^^
AttributeError: 'spacy.kb.kb_in_memory.InMemoryLookupKB' object has no attribute 'load_bulk'

Is this a version compatibility issue perhaps? I am using Python 3.11.6. From 'pip freeze':

prodigy==1.15.0
...
spacy==3.7.4

alan-hogue · February 20, 2024, 10:48pm

Hm, ok, so changing 'kb.bulk_load' to 'kb.from_disk' seems to have worked.

Unfortunately now it is having trouble with the path to the entities file. Please see below:

prodigy entity_linker.manual ner_linked ner_test.jsonl my_output/my_nlp my_outpu
t/my_kb ./assets/entities.csv -F scripts/el_recipe.py

/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'en_core_web_lg' (3.5.0) was trained with spaCy v3.5.0 and may not be 100% compatible with the current version (3.7.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
                 ^^^^^^^^^^^^^^^^^^^^
  File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
  File "/Users/alan/repos/agolo/prodigy-projects/tutorials/nel_emerson/scripts/el_recipe.py", line 29, in entity_linker_manual
    with entity_loc.open("r", encoding="utf8") as csvfile:
         ^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'open'

The argument is supposed to be a string according to the argument help text.

alan-hogue · February 20, 2024, 11:09pm

Wow this is very strange.

I changed that line to this:

"""
with Path(entity_loc).open("r", encoding="utf8") as csvfile:
"""

Now running it gives this:

prodigy entity_linker.manual ner_linked ner_test.jsonl my_output/my_nlp my_output/my_kb /Users/alan/repos/agolo/prodigy-projects/tutorials/nel_emerson/assets/entities.csv -F scripts/el_recipe.py

/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'en_core_web_lg' (3.5.0) was trained with spaCy v3.5.0 and may not be 100% compatible with the current version (3.7.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/alan/.pyenv/versions/3.11.6/lib/python3.11/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
                 ^^^^^^^^^^^^^^^^^^^^
  File "cython_src/prodigy/cli.pyx", line 135, in prodigy.cli.run_recipe
  File "cython_src/prodigy/core.pyx", line 155, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 307, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/stream.pyx", line 191, in prodigy.components.stream.Stream.is_empty
  File "cython_src/prodigy/components/stream.pyx", line 230, in prodigy.components.stream.Stream.peek
  File "cython_src/prodigy/components/stream.pyx", line 343, in prodigy.components.stream.Stream._get_from_iterator
  File "cython_src/prodigy/components/filters.pyx", line 54, in filter_duplicates
  File "/Users/alan/repos/agolo/prodigy-projects/tutorials/nel_emerson/scripts/el_recipe.py", line 58, in _add_options
    candidates = kb.get_candidates(mention)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "spacy/kb/kb_in_memory.pyx", line 259, in spacy.kb.kb_in_memory.InMemoryLookupKB.get_candidates
AttributeError: 'str' object has no attribute 'text'

magdaaniol · February 23, 2024, 4:34pm

Hi @alan-hogue ,

Are you running the orginal emerson demo code or the modification I posted before. In any case, you're missing the nlp_dir in this call

prodigy entity_linker.manual ner_linked ner_test.jsonl my_output/my_nlp my_outpu
t/my_kb ./assets/entities.csv -F scripts/el_recipe.py

which is why the objects are not what they are expected to be which causes different errors while trying to read them.

Also, are you sure you're running the updated version of the tutorial for spacy v3? that is this one here: projects/tutorials/nel_emerson at v3 · explosion/projects · GitHub

alan-hogue · February 23, 2024, 7:06pm

The same thing happens whether running the script you supplied above, or the one in the repo.

I am using the correct repo now, which doesn't seem to make any difference:

https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson

I am running the commands to download the model, create the KB, and I have confirmed that they are all in the places they are supposed to be. All following the README at the link above.

But I am getting the same error:

prodigy entity_linker.manual ner_linked ner_test.jsonl temp/my_nlp temp/my_kb ./assets/entities.csv -F scripts/el_recipe.py
/Users/alan/repos/explosion/projects/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/alan/repos/explosion/projects/.venv/lib/python3.9/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/alan/repos/explosion/projects/.venv/lib/python3.9/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
  File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
  File "scripts/el_recipe.py", line 24, in entity_linker_manual
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=1)
  File "spacy/kb/kb.pyx", line 25, in spacy.kb.kb.KnowledgeBase.__init__
TypeError: [E1046] KnowledgeBase is an abstract class and cannot be instantiated. If you are looking for spaCy's default knowledge base, use `InMemoryLookupKB`.

Note that I have tried running it as it appears exactly in the repo, and your modified version. I get the same results regardless.

Are you able to run this without any errors? I am pretty sure this is all up to date.

magdaaniol · February 23, 2024, 8:11pm

Alright, I confirm that the the recipe code el_recipe.py is slightly outdated (I haven't noticed before - sorry!) in the way it uses the KnowledgeBase and how it read the csv file.

I have updated and tested the version of the script I posted before. Note that this script assumes a jsonl file as input (as per your use case discussed before) and not the txt file as the demo does.

"""
Custom Prodigy recipe to perform manual annotation of entity links,
given an existing NER model and a knowledge base performing candidate generation.
You can run this project without having Prodigy or using this recipe:
sample results are stored in assets/emerson_annotated_text.jsonl
"""

import spacy
from spacy.kb import InMemoryLookupKB, get_candidates

import prodigy
from prodigy.models.ner import EntityRecognizer
from prodigy.components.stream import get_stream #UPDATED
from prodigy.components.filters import filter_duplicates
from prodigy.components.preprocess import split_spans

import csv
from pathlib import Path


@prodigy.recipe(
    "entity_linker.manual",
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a .jsonl file", "positional", None, Path), #UPDATED
    nlp_dir=("Path to the NLP model with a pretrained NER component", "positional", None, Path),
    kb_loc=("Path to the KB", "positional", None, Path),
    entity_loc=("Path to the file with additional information about the entities", "positional", None, Path),
)
def entity_linker_manual(dataset, source, nlp_dir, kb_loc, entity_loc):
    # Load the NLP and KB objects from file
    nlp = spacy.load(nlp_dir)
    kb = InMemoryLookupKB(vocab=nlp.vocab, entity_vector_length=1) #UPDATED
    kb.from_disk(kb_loc)

    # Read the pre-defined CSV file into dictionaries mapping QIDs to the full names and descriptions
    id_dict = dict()
    with Path(entity_loc).open("r", encoding="utf8") as csvfile: #UPDATED
        csvreader = csv.reader(csvfile, delimiter=",")
        for row in csvreader:
            id_dict[row[0]] = (row[1], row[2])

    # Initialize the Prodigy stream by loading the preannotated dataset
    stream = get_stream(source) #UPDATED

    # For each NER mention, add the candidates from the KB to the annotation task
    stream.apply(_add_options, stream=stream, kb=kb, nlp=nlp,id_dict=id_dict) #UPDATED to use the newer API
    stream.apply(split_spans, stream=stream) #NEW we want one entity per task
    stream.apply(filter_duplicates, stream=stream, by_input=False, by_task=True) #UPDATED to use the newer API

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "choice", # choice view if will render NER spans if present in the input
        "config": {"choice_auto_accept": True},
    }


def _add_options(stream, kb, nlp, id_dict):
    """Define the options the annotator will be given, by consulting the candidates from the KB for each NER span."""
    for task in stream:
        text = task["text"]
        for mention in task["spans"]:
            start_char = int(mention["start"])
            end_char = int(mention["end"])
            doc = nlp(text)
            span = doc.char_span(start_char, end_char, mention["label"])

            candidates = get_candidates(kb, span)
            if candidates:
                options = [
                    {"id": c.entity_, "html": _print_url(c.entity_, id_dict)}
                    for c in candidates
                ]

                # we sort the options by ID
                options = sorted(options, key=lambda r: int(r["id"][1:]))

                # we add in a few additional options in case a correct ID can not be picked
                options.append({"id": "NIL_otherLink", "text": "Link not in options"})
                options.append({"id": "NIL_ambiguous", "text": "Need more context"})

                task["options"] = options
                yield task


def _print_url(entity_id, id_dict):
    """ For each candidate QID, create a link to the corresponding Wikidata page and print the description """
    url_prefix = "https://www.wikidata.org/wiki/"
    name, descr = id_dict.get(entity_id)
    option = "<a href='" + url_prefix + entity_id + "'>" + entity_id + "</a>: " + descr
    return option

I also added there the spliting of the spans we discussed in another thread so that the annotator has to deal with one entity at a time.
I will for sure update the demo next week to make sure it's fully compatible with the latest spaCy and prodigy API but hopefully this unblock you for now.

Topic		Replies	Views
prodigy use case for annotation having pre-annotated text usage , solved	8	1263	March 11, 2019
annotating entities in text documents usage , ner , solved	15	9931	November 28, 2017
📺 Video: Training a custom entity linking model with spaCy & Prodigy ner , project	45	6917	May 10, 2021
annotations imported via db-in not showned ner , done , front-end	2	40	August 31, 2024
How to build ABSA (Aspect-Based Sentiment Analysis) annotation recipe by prodigy? usage , custom , solved , medical	13	2630	June 5, 2019

Annotation pipeline - chaining multiple annotation task types

Related topics