hi @rosamond!
Yep, there was a problem with the script. Essentially, I needed to run set_hashes
to get unique input hashes (see deduplication/hashing for why/how hashes are used).
For example, if you run:
PRODIGY_CONFIG_OVERRIDES='{"feed_overlap":true}' python3 -m prodigy textcat_sent_sequence sq en_core_web_sm paragraphs.jsonl --label LEGAL1 -F textcat_sent_sequence.py
Open up a browser and use session=ryan
:
Starts with the first one. I can label all of the examples.
At any time, if you open a new session, which we'll call session=ryan2
:
It will start at the first one again.
Here's the updated recipe:
import prodigy
import spacy
from prodigy.components.loaders import JSONL
from typing import List, Optional
from prodigy import set_hashes
from prodigy.util import split_string
# Helper functions for adding user provided labels to annotation tasks.
def add_label_options_to_stream(stream, labels):
options = [{"id": label, "text": label} for label in labels]
for task in stream:
task["options"] = options
yield task
def add_labels_to_stream(stream, labels):
for task in stream:
task["label"] = labels[0]
yield task
@prodigy.recipe(
"textcat-sent-sequence",
dataset=("Dataset to save answers to", "positional", None, str),
spacy_model=("spaCy model to load", "positional", None, str),
source=("Examples to load from disk", "positional", None, str),
label=("One or more comma-separated labels", "option", "l", split_string),
exclusive=("Treat classes as mutually exclusive", "flag", "E", bool),
)
def textcat_sent_sequence(
dataset: str,
spacy_model: str,
source: str,
label: Optional[List[str]] = None,
exclusive: bool = False
):
# Render highlight of each sentence
def add_html(examples):
for ex in examples:
doc = nlp(ex["paragraph"])
for sent in doc.sents:
summary_highlight = ex["paragraph"]
summary_highlight = summary_highlight.replace(
sent.text, f"<u style='background-color: yellow;'>{sent.text}</u>"
)
ex["text"] = sent.text
ex["html"] = f"{summary_highlight}"
yield ex
def set_hash(examples):
stream = (set_hashes(eg, input_keys=("text", "paragraph")) for eg in examples)
return stream
# import spaCy
nlp = spacy.load(spacy_model)
# set up stream; may want get_stream() instead to hash/avoid dedup
stream = JSONL(source)
stream = add_html(stream)
#Add labels to each task in stream
has_options = len(label) > 1
if has_options:
stream = add_label_options_to_stream(stream, label)
else:
stream = add_labels_to_stream(stream, label)
# delete html key in output data
def before_db(examples):
for ex in examples:
del ex["html"]
return examples
return {
"before_db": before_db,
"dataset": dataset,
"stream": set_hash(stream),
"view_id": "choice" if has_options else "classification",
"config": { # Additional config settings, mostly for app UI
"choice_style": "single" if exclusive else "multiple", # Style of choice interface
"exclude_by": "input" if has_options else "task", # Hash value used to filter out already seen examples
},
}
I went back and cleaned up some other aspects to the recipe:
- Consistent with Prodigy's default recipes, I changed the name of the input file from
examples
to source
.
- Consistent with Prodigy's default recipes, I changed the order of the inputs, putting
spacy_model
(which is renamed from model
) to before the source
.
- I noticed you previously hard coded the label -- see
ex["label"] = "LEGAL INTERPRETATION"
. I removed this and kept the generalization of the multilabels.
- I also added a new input
--exclusive
. This is a flag that when you are using multiple labels, if you set --exclusive
, it'll force users to choose only 1 label (i.e., mutually exclusive categories). If not set (by default), it will allow non-mutually exclusive labels (i.e., you can select multiple labels). This too is consistent with Prodigy's textcat
recipes.
- Also, I changed the recipe function name from
textcat_topic
totextcat_sent_sequence
to be consistent with the recipe name in the decorator textcat-sent-sequence
, which typically uses "-".
Let me know if this doesn't solve your problem.