Welcome to the forum @alan-hogue
It is definitely possible to "propagate" annotations from one task to another.
The return statement of a Prodigy recipe specifies which annotation interface is supposed to be used via the view_id
attribute. This view_id
also determines the required data structure of each example in the annotation stream. If there is a match between this requirement and the examples Prodigy will render it.
In the case of ner.manual
and choice
used in the entity linking tutorial, this process is automatic because choice
annotation interface will render NER spans, if they are available in the example.
So the only modification you'd have to do is to provide the pre-annotated dataset for the recipe to use.
Assuming your curated NER dataset is called "ner-curated", you'd first have to store it on disk by running:
prodigy db-out ner-curated ner-curated.jsonl
and then modify the tutorial's Prodigy recipe so that it uses this dataset rather than the model's annotations.
Here's the updated version of the tutorial's recipe that does that (I added #UPDATED on the modified/new lines):
import spacy
from spacy.kb import KnowledgeBase
import prodigy
from prodigy.models.ner import EntityRecognizer
from prodigy.components.stream import get_stream #UPDATED
from prodigy.components.filters import filter_duplicates
import csv
from pathlib import Path
@prodigy.recipe(
"entity_linker.manual",
dataset=("The dataset to use", "positional", None, str),
source=("The source data as a .jsonl file", "positional", None, Path), #UPDATED
nlp_dir=("Path to the NLP model with a pretrained NER component", "positional", None, Path),
kb_loc=("Path to the KB", "positional", None, Path),
entity_loc=("Path to the file with additional information about the entities", "positional", None, Path),
)
def entity_linker_manual(dataset, source, nlp_dir, kb_loc, entity_loc):
# Load the NLP and KB objects from file
nlp = spacy.load(nlp_dir)
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=1)
kb.load_bulk(kb_loc)
# Read the pre-defined CSV file into dictionaries mapping QIDs to the full names and descriptions
id_dict = dict()
with entity_loc.open("r", encoding="utf8") as csvfile:
csvreader = csv.reader(csvfile, delimiter=",")
for row in csvreader:
id_dict[row[0]] = (row[1], row[2])
# Initialize the Prodigy stream by loading the preannotated dataset
stream = get_stream(source) #UPDATED
# For each NER mention, add the candidates from the KB to the annotation task
stream.apply(_add_options, stream=stream, kb=kb, id_dict=id_dict) #UPDATED to use the newer API
stream.apply(filter_duplicates, stream=stream, by_input=False, by_task=True) #UPDATED to use the newer API
return {
"dataset": dataset,
"stream": stream,
"view_id": "choice", # choice view if will render NER spans if present in the input
"config": {"choice_auto_accept": True},
}
def _add_options(stream, kb, id_dict):
""" Define the options the annotator will be given, by consulting the candidates from the KB for each NER span. """
for task in stream:
text = task["text"]
for span in task["spans"]:
start_char = int(span["start"])
end_char = int(span["end"])
mention = text[start_char:end_char]
candidates = kb.get_candidates(mention)
if candidates:
options = [{"id": c.entity_, "html": _print_url(c.entity_, id_dict)} for c in candidates]
# we sort the options by ID
options = sorted(options, key=lambda r: int(r["id"][1:]))
# we add in a few additional options in case a correct ID can not be picked
options.append({"id": "NIL_otherLink", "text": "Link not in options"})
options.append({"id": "NIL_ambiguous", "text": "Need more context"})
task["options"] = options
yield task
def _print_url(entity_id, id_dict):
""" For each candidate QID, create a link to the corresponding Wikidata page and print the description """
url_prefix = "https://www.wikidata.org/wiki/"
name, descr = id_dict.get(entity_id)
option = "<a href='" + url_prefix + entity_id + "'>" + entity_id + "</a>: " + descr
return option
That's it! You just need to read in the pre-annotated dataset, add KB options to it and Prodigy will render the NER spans.
To mix other types of annotation interfaces, including fully custom ones you can use blocks
. The core of the solution is to match the view_id
expected data structure and the example data structure. The stream of examples can be modified outside the recipe (e.g. if it's another's recipe output) or within the recipe function via Streams apply
method.