Need Help Please: "ValueError: Can't read file: s2v_reddit_2015_md\cfg"

I just purchased prodigy. I really need it to work. I have been following this video (Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning - YouTube) to the letter. In fact, it is the main reason I bought the product, as I am trying to develop an app that can read recipes. I read the installation guide and I thought I was doing everything right. But I continue to receive this cryptic message "ValueError: Can't read file: s2v_reddit_2015_md\cfg" every time I try to to run it. I am a beginner, with decent background in Python. But I basically know what I am doing. Can you please help me get up and running? I have absolutely no idea what I am doing wrong. Here is my code below, straight from the video:

"python -m prodigy sense2vec.teach food_terms s2v_reddit_2015_md --seeds "garlic, avocado, cottage cheese, olive oil, cumin, chicken breast, beef, iceberg lettuce"

Please help.

I purchased this product this week and I am having all kinds of problems. I spent the weekend trying to track down solutions, but it only leads to more problems. Because I could not follow the tutorial above (still getting the "ValueError: Can't read file: s2v_reddit_2015_md\cfg" no matter what I do, I decided instead to try the NER.MANUAL recipe. Yesterday, I compiled my own JSONL file formatted exactly like the file food_patterns.jsonl.

But it does not work with the recipe. I tried to find out why and ran it through the JSON linter program as recommended on another post. I received an error message that reads as follows:

"Error: Parse error on line 1: { 'label': 'FOOD', 'p
--^Expecting 'STRING', '}', got 'undefined."

That makes no sense as, visually at least, my file is the same as presented in your tutorials. So the YouTube tutorial, for whatever reason, does not work because it cannot read the "s2v_reddit_2015_md\cfg" and there is no documentation for that error message or how to handle it. Secondly, I am finding out the JSONL file I spent many hours creating - not knowing any short cuts - is useless.

And I have not heard from anyone in response to my original issue. I don;t know what to say at this point except I hope to hear back soon or I will have to file a dispute through PayPal.

I spent the money because I need a program that will enable to identify food items, quantities and units of measurement. I thought this would make it a lot easier that writing my own code. But I have now invested too many hours looking for solutions that just don't seem to exist. I knew it would be a challenge, but it's too much effort for little reward without adequate technical support.

I will give it a few more days hoping to hear back from someone. I understand that a new piece of software requires a learning curve and more patience, but it also requires strong technical support.

Please let me know how to move forward.

(In addition, I purchased the book Mastering Spacy on Amazon, which specifically purports to tutor the reader in Spacy 3.0 in the preface. But it does not. The lessons and sample code are all written to support 2.0. The only current resource is the book's website, which is free of charge.)

Here is my JSONL file if it helps.
ing_patterns.jsonl (1.6 MB)

As a final note, I will add that I submitted the JSONL code from food_patterns.jsonl, available through one of your tutorials, to the same JSON linter recommended in another post and received a similar EOF error msg. as when I submitted my own.

Hi and sorry you were having problems! We're trying our best to help everyone on this forum but we're not able to be available 24/7 – you posted your question on Friday night and it's now Monday for me and I'm just going through people's questions.

We're very easy to talk to and if you find that Prodigy is not a good fit for what you're trying to do, we're happy to issue you a refund. You shouldn't have to file a dispute for that purpose. Just email us at contact@explosion.ai.

This sounds like sense2vec it can't find the file on disk. Are you sure you downloaded the correct pretrained sense2vec vectors and you're providing the correct path to it? You can download the vectors from here: GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors The path on the CLI should be the path to the downloaded directory, e.g. /path/to/s2v_reddit_2015_md.

I just had a look at your patterns file and the problem is that it contains single quotes instead of double quotes, which JSON requires. The linter you were using just wasn't very helpful here, but I think that's what it's trying to tell you. So if you replace the single quotes with double quotes, the lines should be valid JSON.

When using a linter for JSONL files, keep in mind that it's newline-delimited JSON, so you want to be validating the individual lines (which are JSON objects). So if you put the whole file in there, it's expected that a standard JSON linter will be confused by the newlines.

I have the book as well and from what I can tell, it does make a good effort to show examples for spaCy v2 and v3 where the versions differ. The reality is that most of the user-facing API didn't change very much between the two versions so most inference code you write is identical for v2 and v3. The main difference is around training and there are a couple of new features that are only available in v3. We also provide very in-depth documentation on the website: https://spacy.io/usage, as well as an active discussion forum: Discussions · explosion/spaCy · GitHub

Hi,

I appreciate your response. I apologize for airing my frustrations. As a disabled man, I don't have "weekends" really and I often don't realize it's the weekend for other folks.

With that said, I tried everything I could, navigating error message after error message. I am still struggling, but I am not quite ready to give up.

I looked at the s2v_reddit_2015_md\cfg file and here are the contents:

"{
"senses":[
"NOUN",
"VERB"
]
}"

That's it. There's nothing else in there.

Is that what I should find?

The error message I continue to receive is "can't find the factory for sense2vec." I installed transformers and added the decorators as suggested. Still,no luck. It is possible I am not adding the decorators in the correct manner. I know what they are. I now they are supposed to rest on top the function, but that's about it. Here is the template I am working with right now, which comes from the instructions on your website. I placed s2v_reddit_2015_md in the main directory instead of a "data" subfolder to simplify matters.

"import spacy
from sense2vec import Sense2Vec

s2v = Sense2Vec().from_disk("s2v_reddit_2015_md")
query = "natural_language_processing|NOUN"
assert query in s2v
vector = s2v[query]
freq = s2v.get_freq(query)
most_similar = s2v.most_similar(query, n=3)"

If I can just get over this obstacle, I feel like I can handle the rest.

I look forward to hearing back from you.

Thank you again for your patience with me.

Robert Pfaff

P.S.Here is the error message in its entirety if that helps:

File "C:\Users\rober\prod\venv\lib\site-packages\srsly_json_api.py", line 51, in read_json
file_path = force_path(path)
File "C:\Users\rober\prod\venv\lib\site-packages\srsly\util.py", line 24, in force_path
raise ValueError(f"Can't read file: {location}")
ValueError: Can't read file: s2v_reddit_2015_md\cfg

I am working within a venv virtual environment if that helps.

Thanks, including the full error message and traceback is always helpful because it often contains relevant clues. It still sounds like the main problem is that the sense2vec recipe can't find the vectors you've downloaded. Are you 100% sure that when you run this command:

... the path to s2v_reddit_2015_md is correct and in your working directory? You can also specify an absolute path just to be sure, e.g. C:\wherever\your\files\are\s2v_reddit_2015_md.

The second argument of the sense2vec.teach recipe is a path to the sense2vec vectors you downloaded. The directory should contain the cfg and then a bunch of large binary files (the vectors). If not, maybe double-check that you've downloaded and extracted the files correctly.

I am 99.9% sure, but I am also doubting my basic competency levels at this point.

I am uploading two images - one showing the file downloaded to the main directory and the other showing it downloaded to the data subfolder, which I knew was overkill, but I wanted to be extra certain.

In addition, when I try the use this command in the instructions ("cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz") as instructed in the recipe I receive an error message that no module cat can be found and I was unable to identify it and install it.

Could that be the issue?

The one recipe that works, if this helps, is this one:

"prodigy ner.manual ner_news en_core_web_sm ./news_headlines.jsonl --label PERSON,ORG,PRODUCT"

But, if I try to change or tweak it to suit my purposes, I run into problems. For example, I did figure out over the weekend that I needed to change the quotes in my JSONL file to double quotes and it passed the Linter test if I tested it line by line. So that file should work.

But, if I change news_headlines.jsonl to food_patterns.jsonl, for example, I receive the "error when validating stream: no first example" error message.

From reading a related post, I understand that means there is something wrong with the file or they way the data is formatted.
But, as mentioned above, I fixed the quotes problem. I even tried to move that data to the news_headlines file to see if something was just corrupt about my file.

But that did not work either. At one point, the system just kinda froze as if it was trying to process the data in my JSONL file, but it never completed the task.

So there was a lot of effort and hours over the weekend trying to find anything that would work on my system with only a modicum of success.

I still don't want to give up! If you have any thoughts, please let me know. Thanks for your continued attention to this matter.

Robert Pfaff

I am also uploading my pip freeze if that helps.

Thanks for sharing the screenshots, that's definitely helpful. I think you're right and the problem here is that the .tar.gz archive wasn't extracted correctly. One easy solution on Windows would be to just use something like WinZip or 7-zip, which should be able to handle .tar.gz out of the box: How to open a .tar.gz file in Windows? - Super User

If you've extracted the archive correctly, this should give you a directory with the following contents that you can then pass to the recipe as the sense2vec vectors to use:

What are you trying to achieve by passing in your patterns file as the input text? The problem here is that the patterns file describes patterns to highlight in text, but it's not a suitable text source to annotate because... there's kinda no text. So you usually want the argument that's news_headlines.jsonl in the example to be a source of raw text to annotate – raw text of recipes, comments scraped from Reddit, and so on. You can then optionally provide the patterns via the --patterns argument to help you pre-highlight spans in the incoming text so you have to do less manual work.

Yeah!

This seems to be working now though its a little different than expected or shown on the video. I took screen shots throughout the process if you want to see them all to see what I mean.

The main differences were 1) I had to unzip the files twice using 7-zip, and then 2) the cfg file sits inside a s2v_old folder inside the s2v_reddit_2015_md folder, as shown in the path/image below.

But I am just glad its working. I even ran this line to make doubly sure and it works.

python -m prodigy sense2vec.teach food_terms C:/Users/rober/prod2/s2v_reddit_2015_md/s2v_old --seeds "garlic, avocado, cottage cheese, olive oil, cumin, chicken breast, beef, iceberg lettuce"

Thanks for your help.

As to your second question about substituting the food_parterns.jsonl file for the news_headlines file, I was just at a point where I wanted to see if changing jsonl files would provoke a reaction that gave me some clues to work with. An act of desperation.

But I think we're cooking with gas now, as they say.

Thanks again.