I would like to convert text in stream data from full-size to half-size and from uppercase to lowercase.
Can you give me some advice?
In fact, I would like to use the Japanese learning model for the replacement process, but I am using the English learning model for verification.
Is it possible to process the stream data in the recipe file to achieve the text replacement process?
If I print out the stream data as shown in the following program, an error occurs.
I want to replace the text in the stream data, but I can't embed the replacement process because of the error.
If comment out "stream = load_json(source)", the error will not occur.
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string
import spacy
from typing import List, Optional
@prodigy.recipe(
"ner.manual",
dataset=("The dataset to use", "positional", None, str),
spacy_model=("The base model", "positional", None, str),
source=("The source data as a JSONL file", "positional", None, str),
label=("One or more comma-separated labels", "option", "l", split_string),
exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def ner_manual(
dataset: str,
spacy_model: str,
source: str,
label: Optional[List[str]] = None,
exclude: Optional[List[str]] = None,
):
nlp = spacy.load(spacy_model)
stream = JSONL(source)
stream = load_json(source)
stream = add_tokens(nlp, stream)
print("after_add_tokens")
return {
"view_id": "ner_manual", # Annotation interface to use
"dataset": dataset, # Name of dataset to save annotations
"stream": stream, # Incoming stream of examples
"exclude": exclude, # List of dataset names to exclude
"config": { # Additional config settings, mostly for app UI
"lang": nlp.lang,
"labels": label, # Selectable label options
"validate": False
},
}
def load_json(source: str):
stream = JSONL(source)
for data in stream:
print(data["text"])
print("Loop_after")
return stream
$ prodigy ner.manual example_dataset en_core_web_md news_headlines_bk.jsonl --label Organization,Person,Location,ID,EMAIL,PHONE, -F ner_manual.py
Uber’s Lesson: Silicon Valley’s Start-Up Machine Needs Fixing
Pearl Automation, Founded by Apple Veterans, Shuts Down
How Silicon Valley Pushed Coding Into American Classrooms
Women in Tech Speak Frankly on Culture of Harassment
Loop_after
after_add_tokens
✘ Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.