How to replace text in stream data?

Masatoshi · November 8, 2021, 1:20pm

I would like to convert text in stream data from full-size to half-size and from uppercase to lowercase.
Can you give me some advice?
In fact, I would like to use the Japanese learning model for the replacement process, but I am using the English learning model for verification.
Is it possible to process the stream data in the recipe file to achieve the text replacement process?

If I print out the stream data as shown in the following program, an error occurs.
I want to replace the text in the stream data, but I can't embed the replacement process because of the error.
If comment out "stream = load_json(source)", the error will not occur.

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string
import spacy
from typing import List, Optional


@prodigy.recipe(
    "ner.manual",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
    
)
def ner_manual(
    dataset: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    exclude: Optional[List[str]] = None,
):
    nlp = spacy.load(spacy_model)

    stream = JSONL(source)  
    stream = load_json(source)

    stream = add_tokens(nlp, stream)
    print("after_add_tokens")
    return {
        "view_id": "ner_manual",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "exclude": exclude,  # List of dataset names to exclude
        "config": {  # Additional config settings, mostly for app UI
            "lang": nlp.lang,
            "labels": label,  # Selectable label options
            "validate": False            
        },
        
    }

def load_json(source: str):
    stream = JSONL(source)  
    for data in stream:
        print(data["text"])
    print("Loop_after")
    return stream

$  prodigy ner.manual example_dataset en_core_web_md news_headlines_bk.jsonl --label Organization,Person,Location,ID,EMAIL,PHONE, -F ner_manual.py
Uber’s Lesson: Silicon Valley’s Start-Up Machine Needs Fixing
Pearl Automation, Founded by Apple Veterans, Shuts Down
How Silicon Valley Pushed Coding Into American Classrooms
Women in Tech Speak Frankly on Culture of Harassment
Loop_after
after_add_tokens

✘ Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.

ines · November 8, 2021, 1:23pm

Hi! The problem here is that stream is a Python generator, so by iterating over it, you're consuming it and what's left is an empty generator. If you want to modify the stream, the best way to do it is to wrap it in a generator function. For example:

def load_json(source: str):
    stream = JSONL(source)
    for eg in stream:
        print(eg["text"])
        # do something to the example here...
        yield eg

Masatoshi · November 19, 2021, 2:18am

Hi Ines,

Thanks for the quick reply!
I've resolved this issue and will close it.
Thank you very much for your help.

Topic		Replies	Views
How do I ner.print-stream on synthetic training data? ner , solved	2	1073	January 16, 2018
Convert output of spaCy PhraseMatcher to prodigy JSONL ner , spacy , solved	3	1144	May 3, 2021
Data format for label correction task based on pre-labelled dataset Getting Started	5	348	June 24, 2022
Load annotated data in .spacy format to Prodigy for further correction	2	310	September 20, 2023
Text corpus .txt file to json/spacy format file usage , spacy , solved	5	1316	July 2, 2021

How to replace text in stream data?

Related topics