Thanks @magdaaniol.
I'll share my custom recipe at the bottom. I have an input JSONL file that contains only texts. Here's an example:
{"text": "This is sentence number 1", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}
{"text": "This is sentence number 2", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}
{"text": "This is sentence number 3", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}
Initially, I run the following Prodigy command:
PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy rel.manual DATABASE blank:en input.jsonl --label LABEL --span-label SLABEL --wrap
I annotate the first sentence and save.
Then, I run my custom recipe with the command:
PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy ner-re-custom DATABASE input.jsonl -F custom_recipe.py
I have also ran the same command with PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true, "exclude_by": "input"}'
. In both cases, I am asked to annotate Sentence 1 again. I want to continue where I left off with the previous recipe, so the interface would show me Sentence 2.
On the contrary, if I run ner.manual
, I can continue were I left off with rel.manual
, so Prodigy shows me Sentence 2:
PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy ner.manual DATABASE blank:en input.jsonl --label LABEL
Of course, running rel.manual
first and then ner.manual
may not make sense; I only do this to show that Prodigy doesn't show examples previously annotated with another recipe, but when using my custom recipe it does show those examples again.
I can also confirm that my custom recipe is generating different input and task hashes than Prodigy's ner.manual
and rel.manual
. What can I do in my recipe to match the input and task hashes to those from Prodigy? The interface elements that I'm adding in my custom recipe (choice, text input) are for annotator feedback only, but the actual input is the text, so I want Prodigy to know that these are the same tasks as when we were annotating with rel.manual
. I also want to save to the same database we have been using, or perhaps to a new one and use the --exclude
flag, but I still need the hashes to match for that to work.
Here is my custom recipe:
import prodigy
from prodigy.core import Arg, recipe
from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens
from prodigy.components.loaders import JSONL
import spacy
from pathlib import Path
from prodigy import set_hashes
@prodigy.recipe(
"ner-re-custom",
dataset = Arg(help="Dataset to save answers to."),
file_path=Arg(help="Path to texts")
)
def ner_re_custom(dataset: str, file_path):
stream = get_stream(file_path) # load in the JSON file
nlp = spacy.blank("en") # blank spaCy pipeline for tokenization
stream.apply(create_hashes, stream) # tokenize the stream for ner_manual
stream.apply(add_tokens, nlp, stream) # tokenize the stream for ner_manual
stream.apply(add_options, stream) # add options to each example
stream.apply(add_html, stream) # add html to each example
blocks = [
{"view_id": "relations"},
{"view_id": "choice", "text": None},
{"view_id": "text_input", "field_rows": 3, "field_label": "Comments", "field_id": "comments"}
]
# read the js code
custom_js_path = Path(__file__).resolve().parent / "custom.js"
custom_js = custom_js_path.read_text()
return {
"dataset": dataset,
"view_id": "blocks",
"stream": stream,
"config": {
"labels": ["LABEL"],
"relations_span_labels": ["SLABEL"],
"blocks": blocks,
"choice_style": "multiple",
"wrap_relations": True,
"javascript": custom_js,
"custom_theme": {
"cardMaxWidth": 1500,
"smallText": 16,
"relationHeightWrap": 40
}
}
}
def add_options(stream):
# Helper function to add options to every task in a stream
options = [
{"id": "option_1", "text": "Option 1"},
{"id": "option_2", "text": "Option 2"},
{"id": "option_3", "text": "Option 3"},
]
for eg in stream:
eg["options"] = options
yield eg
def add_html(stream):
"""Adds html field to the stream examples"""
html_string = '<h3 style="text-align: left; margin-bottom: 0;">Edge case category</h3>'
for eg in stream:
eg["html"] = html_string
yield eg
def create_hashes(stream):
for eg in stream:
eg = set_hashes(eg, input_keys=("text"), task_keys=("spans", "arcs"))
yield eg
Thanks!