Hi @magdaaniol,
Here, the output of the Prodigy runs with the PRODIGY_LOGGING=verbose setting. I have trimmed some text because the text is too long (note: I cannot attach a text file in this reply).
12:01:55: CLI: Importing file …/prodigy/recipe.py
Using 6 label(s): Appreciation, Problem, Suggestion, Neutral, Personal Comment,
Off Topic
Using 6 label(s): Appreciation, Problem, Suggestion, Neutral, Personal Comment,
Off Topic
12:01:56: RECIPE: Calling recipe 'ner.withtextcat'
12:01:58: .prodigy/prodigy.json
12:01:58: VALIDATE: Validating components returned by recipe
12:01:58: CONTROLLER: Initialising from recipe
12:01:58: CONTROLLER: Recipe Config
12:01:58: {'lang': 'id', 'labels': ['Appreciation', 'Problem', 'Suggestion', 'Neutral', 'Personal Comment', 'Off Topic'], 'choice_style': 'multiple', 'blocks': [{'view_id': 'ner_manual'}, {'view_id': 'choice', 'text': None}], 'dataset': 'peer-review-masdig-v2', 'recipe_name': 'ner.withtextcat', 'theme': 'spacy', 'custom_theme': {'labels': {'Appreciation': '#7fffd4', 'Problem': '#9932cc', 'Suggestion': '#ff00ff', 'Neutral': '#00ff7f', 'Personal Comment': '#ff6347', 'Off Topic': '#00bfff'}}, 'buttons': ['accept', 'ignore', 'undo'], 'batch_size': 5, 'history_size': 5, 'port': 8080, 'host': '192.168.15.18', 'cors': True, 'db': 'sqlite', 'db_settings': {}, 'validate': True, 'auto_exclude_current': True, 'instant_submit': False, 'feed_overlap': True, 'annotations_per_task': None, 'allow_work_stealing': False, 'total_examples_target': 0, 'ui_lang': 'en', 'project_info': ['dataset', 'session', 'lang', 'recipe_name', 'view_id', 'label'], 'show_stats': True, 'hide_meta': False, 'show_flag': True, 'instructions': False, 'swipe': False, 'swipe_gestures': {'left': 'accept', 'right': 'reject'}, 'split_sents_threshold': False, 'html_template': False, 'global_css': None, 'global_css_dir': None, 'javascript': None, 'javascript_dir': None, 'writing_dir': 'ltr', 'show_whitespace': False}
12:01:58: VALIDATE: Creating validator for view ID 'blocks'
12:01:58: CONTROLLER: Using `full_overlap` router.
12:01:58: VALIDATE: Validating Prodigy and recipe config
12:01:58: PREPROCESS: Tokenizing examples (running tokenizer only)
12:01:58: .prodigy/prodigy.json
12:01:58: DB: Creating unstructured dataset '2024-11-26_12-01-58'
12:01:58: {'created': datetime.datetime(2024, 11, 11, 6, 17, 18)}
12:01:58: CORS: initialized with wildcard "*" CORS origins
Starting the web server at http://192.168.15.18:8080 ...
Open the app in your browser and start annotating!
INFO: Started server process [53466]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://192.168.15.18:8080 (Press CTRL+C to quit)
INFO: 192.168.15.59:26340 - "GET /?session=ysp HTTP/1.1" 200 OK
INFO: 192.168.15.59:26340 - "GET /bundle.js HTTP/1.1" 200 OK
12:03:54: .prodigy/prodigy.json
INFO: 192.168.15.59:26350 - "GET /userinfo HTTP/1.1" 404 Not Found
INFO: 192.168.15.59:26340 - "GET /project/ysp HTTP/1.1" 200 OK
INFO: 192.168.15.59:26340 - "GET /fonts/robotocondensed-bold.woff2 HTTP/1.1" 200 OK
INFO: 192.168.15.59:26350 - "GET /fonts/lato-regular.woff2 HTTP/1.1" 200 OK
INFO: 192.168.15.59:26360 - "GET /fonts/lato-bold.woff2 HTTP/1.1" 200 OK
INFO: 192.168.15.59:26340 - "GET /favicon.ico HTTP/1.1" 200 OK
12:03:54: .prodigy/prodigy.json
12:03:54: POST: /get_session_questions
12:03:54: CONTROLLER: Getting batch of questions for session: peer-review-masdig-v2-ysp
12:03:54: STREAM: Created queue for peer-review-masdig-v2-ysp.
12:03:54: ROUTER: Routing item with _task_hash=1422345350 -> ['peer-review-masdig-v2-ysp']
...
12:03:55: ROUTER: Routing item with _task_hash=-1709185225 -> ['peer-review-masdig-v2-ysp']
12:03:55: RESPONSE: /get_session_questions (0 examples)
12:03:55: {'tasks': [], 'total': 727, 'progress': None, 'session_id': 'peer-review-masdig-v2-ysp'}
INFO: 192.168.15.59:26340 - "POST /get_session_questions HTTP/1.1" 200 OK
INFO: 192.168.15.59:5628 - "GET /?session=ysp HTTP/1.1" 200 OK
INFO: 192.168.15.59:5628 - "GET /bundle.js HTTP/1.1" 200 OK
12:04:09: .prodigy/prodigy.json
INFO: 192.168.15.59:5628 - "GET /project/ysp HTTP/1.1" 200 OK
INFO: 192.168.15.59:5628 - "GET /userinfo HTTP/1.1" 404 Not Found
INFO: 192.168.15.59:5628 - "GET /fonts/robotocondensed-bold.woff2 HTTP/1.1" 200 OK
INFO: 192.168.15.59:5642 - "GET /fonts/lato-regular.woff2 HTTP/1.1" 200 OK
INFO: 192.168.15.59:5658 - "GET /fonts/lato-bold.woff2 HTTP/1.1" 200 OK
12:04:09: .prodigy/prodigy.json
12:04:09: POST: /get_session_questions
12:04:09: CONTROLLER: Getting batch of questions for session: peer-review-masdig-v2-ysp
12:04:09: RESPONSE: /get_session_questions (0 examples)
12:04:09: {'tasks': [], 'total': 727, 'progress': None, 'session_id': 'peer-review-masdig-v2-ysp'}
INFO: 192.168.15.59:5628 - "POST /get_session_questions HTTP/1.1" 200 OK
The "No tasks available" screen immediately appears when I access the server. I use the Prodigy 1.16.0 version. Here is the custom recipe that I used:
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string, set_hashes
from prodigy.util import get_labels
import spacy
import copy
from typing import List, Optional
def make_tasks(nlp, stream, labels):
"""Add a 'spans' key to each example, with predicted entities."""
# Process the stream using spaCy's nlp.pipe, which yields doc objects.
# If as_tuples=True is set, you can pass in (text, context) tuples.
texts = ((eg["text"], eg) for eg in stream)
for doc, eg in nlp.pipe(texts, as_tuples=True):
task = copy.deepcopy(eg)
spans = []
for ent in doc.ents:
# Continue if predicted entity is not selected in labels
if labels and ent.label_ not in labels:
continue
# Create span dict for the predicted entitiy
spans.append(
{
"token_start": ent.start,
"token_end": ent.end - 1,
"start": ent.start_char,
"end": ent.end_char,
"text": ent.text,
"label": ent.label_,
}
)
task["spans"] = spans
# Rehash the newly created task so that hashes reflect added data
task = set_hashes(task)
yield task
def add_options(stream, options):
"""Helper function to add options to every task in a stream."""
options = [{"id": option, "text": option} for option in options]
for task in stream:
task["options"] = options
yield task
# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
"ner.withtextcat",
dataset=("The dataset to use", "positional", None, str),
spacy_model=("The base model", "positional", None, str),
source=("The source data as a JSONL file", "positional", None, str),
labelner=("One or more comma-separated labels", "option", "l", get_labels),
labeltextcat=("One or more comma-separated labels for text classficiation", "option", "ltextcat", get_labels),
exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def ner_withtextcat(
dataset: str,
spacy_model: str,
source: str,
labelner: Optional[List[str]] = None,
labeltextcat: Optional[List[str]] = None,
exclude: Optional[List[str]] = None,
):
"""
Create gold-standard data by correcting a model's predictions manually.
"""
# Load the spaCy model
nlp = spacy.load(spacy_model)
# Load the stream from a JSONL file and return a generator that yields a
# dictionary for each example in the data.
stream = JSONL(source)
# Tokenize the incoming examples and add a "tokens" property to each
# example. Also handles pre-defined selected spans. Tokenization allows
# faster highlighting, because the selection can "snap" to token boundaries.
stream = add_tokens(nlp, stream)
# Add the entities predicted by the model to the tasks in the stream
stream = make_tasks(nlp, stream, labelner)
stream = add_options(stream, labeltextcat)
return {
"view_id": "blocks", # Annotation interface to use
"dataset": dataset, # Name of dataset to save annotations
"stream": stream, # Incoming stream of examples
"exclude": exclude, # List of dataset names to exclude
"config": { # Additional config settings, mostly for app UI
"lang": nlp.lang,
"labels": labelner,
"choice_style": "multiple",
"blocks": [
{"view_id": "ner_manual"},
{"view_id": "choice", "text": None}
]
}
}
Lastly, here is the output from your script:
=== Prodigy Annotation Analysis ===
Input file: …/peer-review/dataset/peer_review_dataset.jsonl
Dataset 1: peer-review-masdig-v2-ysp
Dataset 2: peer-review-masdig-v2-dev
⚠ Prodigy automatically assigned an input/task hash because it was
missing. This automatic hashing will be deprecated as of Prodigy v2 because it
can lead to unwanted duplicates in custom recipes if the examples deviate from
the default assumptions. More information can found on the docs:
https://prodi.gy/docs/api-components#set_hashes
=== Input Stream Statistics ===
Total input hashes: 740
Unique input hashes: 727
Note: Input stream contains 7 duplicate hashes
Hash: 628546903 appears 2 times
Hash: -927966096 appears 2 times
Hash: -2039616366 appears 8 times
Hash: -1373306777 appears 2 times
Hash: 703580025 appears 2 times
Hash: -1794876750 appears 2 times
Hash: -1821170728 appears 2 times
=== Analysis for peer-review-masdig-v2-ysp ===
Input hashes: 727
Unique input hashes: 727
Task hashes: 727
Unique task hashes: 727
=== Analysis for peer-review-masdig-v2-dev ===
Input hashes: 734
Unique input hashes: 734
Task hashes: 734
Unique task hashes: 734
=== Comparison Analysis ===
Comparing peer-review-masdig-v2-ysp with input stream:
Missing from dataset: 0 hashes
Extra in dataset: 0 hashes
Comparing peer-review-masdig-v2-dev with input stream:
Missing from dataset: 0 hashes
Extra in dataset: 7 hashes
Comparing peer-review-masdig-v2-ysp vs peer-review-masdig-v2-dev:
In dataset 1 but not in 2: 0 hashes
In dataset 2 but not in 1: 7 hashes
I hope this information can help us investigate this problem, and I can reconfigure it again so the annotator can annotate the 7 tasks left.
Thank you for your help and assistance. Really appreciated it.