ner-food-ingredients (tutorial) still trying to get it to run

deewuok · December 22, 2023, 7:25pm

My current problem is that I get:
on output of the 01_Preprocess_Reddit.ipynb, code at the bottom.

12/22/2023 01:12 PM 0 reddit.jsonl

This is straight from the examples I think (perhaps I dont understand how to enter the correct gz file

I also tried futzing with the iterator (with things like

" # .gz archive or directory of archives
OUTPUT_FILE = "C:\Users\dwu\Documents\ner_Prodigy\ner-food-ingredients\reddit.jsonl" # path to output JSONL
#%%
!pip install srsly
#%%
import re
from pathlib import Path
import gzip
import srsly
#%%
class Reddit(object):
"""Stream cleaned comments from Reddit."""

pre_format_re = re.compile(r"^[\`\*\~]")
post_format_re = re.compile(r"[\`\*\~]$")
url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")

def __init__(
    self, file_path, meta_keys={"subreddit": "section", "created_utc": "utc"}
):
    """
    file_path (unicode / Path): Path to archive or directory of archives.
    meta_keys (dict): Meta data key included in the Reddit corpus, mapped
        to display name in Prodigy meta.
    RETURNS (Reddit): The Reddit loader.
    """
    self.meta = meta_keys
    self.file_path = Path(file_path)
    if not self.file_path.exists():
        raise IOError(f"Can't find file path: {self.file_path}")

def __iter__(self):
    for file_path in self.iter_files():
        with gzip.open(str(file_path), "rb") as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                try:
                    comment = srsly.json_loads(line)
                except:
                    print(line)
                if self.is_valid(comment):
                    text = self.strip_tags(comment["body"])
                    yield {"text": text, "meta": self.get_meta(comment)}

def get_meta(self, item):
    return {name: item.get(key, "n/a") for key, name in self.meta.items()}

def iter_files(self):
    if not self.file_path.is_dir():
        return [self.file_path]
    yield from self.file_path.glob("**/*.gz")

def strip_tags(self, text):
    text = self.link_re.sub(r"\1", text)
    text = text.replace("&gt;", ">").replace("&lt;", "<")
    text = self.pre_format_re.sub("", text)
    text = self.post_format_re.sub("", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

def is_valid(self, comment):
    return (
        comment["body"] is not None
        and comment["body"] != "[deleted]"
        and comment["body"] != ""
    )

#%%
stream = Reddit(INPUT_DATA)
srsly.write_jsonl(OUTPUT_FILE, stream)
#%%
stream = Reddit(INPUT_DATA)
for x in stream.iter_files():
print(x)
#%%
stream = Reddit(INPUT_DATA)
i=0
for x in stream:
if i == 5:
break
print(x)

All of these produce null output

If anyone has ideas that would be great!
Dee

INPUT_DATA = "C:\Users\dwu\Documents\ner_Prodigy\ner-food-ingredients\s2v_reddit_2015_md.tar.gz" # .gz archive or directory of archives
OUTPUT_FILE = "C:\Users\dwu\Documents\ner_Prodigy\ner-food-ingredients\reddit.jsonl" # path to output JSONL
#%%
!pip install srsly
#%%
import re
from pathlib import Path
import gzip
import srsly
#%%
class Reddit(object):
"""Stream cleaned comments from Reddit."""

pre_format_re = re.compile(r"^[\`\*\~]")
post_format_re = re.compile(r"[\`\*\~]$")
url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")

def __init__(
    self, file_path, meta_keys={"subreddit": "section", "created_utc": "utc"}
):
    """
    file_path (unicode / Path): Path to archive or directory of archives.
    meta_keys (dict): Meta data key included in the Reddit corpus, mapped
        to display name in Prodigy meta.
    RETURNS (Reddit): The Reddit loader.
    """
    self.meta = meta_keys
    self.file_path = Path(file_path)
    if not self.file_path.exists():
        raise IOError(f"Can't find file path: {self.file_path}")

def __iter__(self):
    for file_path in self.iter_files():
        with gzip.open(str(file_path), "rb") as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                comment = srsly.json_loads(line)
                if self.is_valid(comment):
                    text = self.strip_tags(comment["body"])
                    yield {"text": text, "meta": self.get_meta(comment)}

def get_meta(self, item):
    return {name: item.get(key, "n/a") for key, name in self.meta.items()}

def iter_files(self):
    if not self.file_path.is_dir():
        return [self.file_path]
    yield from self.file_path.glob("**/*.gz")

def strip_tags(self, text):
    text = self.link_re.sub(r"\1", text)
    text = text.replace("&gt;", ">").replace("&lt;", "<")
    text = self.pre_format_re.sub("", text)
    text = self.post_format_re.sub("", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

def is_valid(self, comment):
    return (
        comment["body"] is not None
        and comment["body"] != "[deleted]"
        and comment["body"] != ""
    )

#%%
stream = Reddit(INPUT_DATA)
srsly.write_jsonl(OUTPUT_FILE, stream)
#%%

PS. I was able to get the [tok2vec_cd8_model289.bin] here

https://github.com/explosion/projects/releases/download/tok2vec/tok2vec_cd8_model289.bin): GitHub - Dwonczykj/ner_food

NOTE: I also tried to un-gz and untar the files: and point to the INPUT_DATA = "C:\Users\dwu\Documents\ner_Prodigy\ner-food-ingredients\s2v_old"
but that didnt work either

deewuok · December 28, 2023, 10:34pm

Hi, Still not able to get the code working (trying to rewrite my question a little more clearer!)

based on the code that is in this:

(projects/tutorials/ner_food_ingredients at v3 · explosion/projects · GitHub)

INPUT_DATA = "C:\\Users\\dwu\\Documents\\ner_Prodigy\\ner-food-ingredients\\s2v_reddit_2015_md.tar.gz"       # .gz archive or directory of archives
OUTPUT_FILE = "C:\\Users\\dwu\\Documents\\ner_Prodigy\\ner-food-ingredients\\reddit.jsonl" 

# I also tried to untar/uncompress  s2v_reddit_2015_md.tar.gz and point to s2v_old directory (that's didnt work either) 

!pip install srsly

import re
from pathlib import Path
import gzip
import srsly

class Reddit(object):
    """Stream cleaned comments from Reddit."""

    pre_format_re = re.compile(r"^[\`\*\~]")
    post_format_re = re.compile(r"[\`\*\~]$")
    url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
    link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")

    def __init__(
        self, file_path, meta_keys={"subreddit": "section", "created_utc": "utc"}
    ):
        """
        file_path (unicode / Path): Path to archive or directory of archives.
        meta_keys (dict): Meta data key included in the Reddit corpus, mapped
            to display name in Prodigy meta.
        RETURNS (Reddit): The Reddit loader.
        """
        self.meta = meta_keys
        self.file_path = Path(file_path)
        if not self.file_path.exists():
            raise IOError(f"Can't find file path: {self.file_path}")

    def __iter__(self):
        for file_path in self.iter_files():
            with gzip.open(str(file_path), "rb") as f:
                for line in f:
                    line = line.strip()
                    if not line:
                        continue
                    try:
                        comment = srsly.json_loads(line)
                    except:
                        print(line)
                    if self.is_valid(comment):
                        text = self.strip_tags(comment["body"])
                        yield {"text": text, "meta": self.get_meta(comment)}

    def get_meta(self, item):
        return {name: item.get(key, "n/a") for key, name in self.meta.items()}

    def iter_files(self):
        if not self.file_path.is_dir():
            return [self.file_path]
        yield from self.file_path.glob("**/*.gz")

    def strip_tags(self, text):
        text = self.link_re.sub(r"\1", text)
        text = text.replace("&gt;", ">").replace("&lt;", "<")
        text = self.pre_format_re.sub("", text)
        text = self.post_format_re.sub("", text)
        text = re.sub(r"\s+", " ", text)
        return text.strip()

    def is_valid(self, comment):
        return (
            comment["body"] is not None
            and comment["body"] != "[deleted]"
            and comment["body"] != ""
        )
#%%
stream = Reddit(INPUT_DATA)
srsly.write_jsonl(OUTPUT_FILE, stream)

stream = Reddit(INPUT_DATA)
srsly.write_jsonl(OUTPUT_FILE, stream)

I also tried to iterate through the stream etc with no result.

stream = Reddit(INPUT_DATA)
for x in stream.iter_files():
print(x)

This code came from:
[01_Preprocess_Reddit.ipynb] (projects/tutorials/ner_food_ingredients/notebooks/01_Preprocess_Reddit.ipynb at v3 · explosion/projects · GitHub)

The original file is:
12/22/2023 10:39 AM 600,444,501 s2v_reddit_2015_md.tar.gz (has a size that makes sense)

**THE ISSUE IS that OUTPUTFILE provides a file with 0 size i.e. 12/28/2023 04:01 PM 0 reddit.jsonl. **
I'm not sure what is going on. if someone can let me know

magdaaniol · January 3, 2024, 11:09am

Hi @deewuok,

The reason the tutorial script doesn't work for you is that you're passing the sense2vec vectors as input file where raw comments data is expected.
s2v_reddit_2015_md.tar.gz (the INPUT_DATA in your snippet) contains the trained word vectors not the raw comments data.

It used to be possible to download the raw data here: https://files.pushshift.io/reddit/comments/. I believe that now you need to be a registered user of Reddit API to do that.
You could also skip the preprocessing step and continue the tutorial with the Prodigy ready data that is versioned in tutorial's github.

Topic		Replies	Views
Having problems with file during ner.manual (Error while validating stream: no first example) usage , solved , streams	3	874	November 16, 2021
Strange OSError Using the Reddit Loader ner , solved	3	728	July 23, 2018
Can't find recipe usage	1	273	June 22, 2023
First Project Data won't load to prodigy ner , project	5	330	August 16, 2023
Error while running a variant of the old ner_manual recipe in Python3.12	2	115	May 2, 2024

ner-food-ingredients (tutorial) still trying to get it to run

Related topics