Error - strings.json

python -m prodigy sense2vec.teach he_people smalltext.txt --seeds "Kevin Cambridge,Charles Johns,Cloudesley Shovel,Roy Graham,Terry Hiron,Rex Cowan,Bob Rogers"

smalltext contents is:

Tearing Ledge is a rock pinnacle, south-east of the Bishop Rock lighthouse in the Western Rocks, Isles of Scilly. The wreck lies within some of the most spectacular submarine topography in the British Isles. Originally identified as the Romney, the wreck is now believed to be the Eagle. Both of these were lost in the same night on the 22nd October 1707 when Sir Cloudesley Shovell’s fleet, returning from the Mediterranean, foundered on the Western Rocks. The Eagle was a 70-gun third-rate, initially built at Portsmouth in 1679 and then rebuilt at Chatham in 1699. The Protected Wreck site (List entry 1000063) lies within the Isles of Scilly Special Area of Conservation and the Bishop to Crim Marine Conservation Zone, part of the Isles of Scilly Marine Conservation Zone. This Conservation Statement and Management Plan was produced to enable local and regional stakeholder involvement in our aspirations for the conservation management of the Tearing Ledge Protected Wreck site, so as to balance protection with economic and social needs. The principle aim of the Plan is to identify a shared vision of how the values and features of Tearing Ledge can be conserved, maintained and enhanced.

Fails with:

Traceback (most recent call last):
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\site-packages\prodigy_main
.py", line 61, in
controller = recipe(args, use_plac=True)
File "cython_src\prodigy\core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\site-packages\plac_core.py", line 232, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\site-packages\sense2vec\prodigy_recipes.py", line 54, in teach
s2v = Sense2Vec().from_disk(vectors_path)
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\site-packages\sense2vec\sense2vec.py", line 342, in from_disk
self.vectors = Vectors().from_disk(path)
File "spacy\vectors.pyx", line 616, in spacy.vectors.Vectors.from_disk
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\util.py", line 1299, in from_disk
reader(path / key)
File "spacy\vectors.pyx", line 609, in spacy.vectors.Vectors.from_disk.lambda8
File "spacy\strings.pyx", line 238, in spacy.strings.StringStore.from_disk
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\site-packages\srsly_json_api.py", line 51, in read_json
file_path = force_path(path)
File "C:\Users\Matt\AppData\Local\Programs\Python\Python39\lib\site-packages\srsly\util.py", line 24, in force_path
raise ValueError(f"Can't read file: {location}")
ValueError: Can't read file: smalltext.txt\strings.json

Hi! The problem here is that the sense2vec.teach recipe takes the path to sense2vec vectors as its first argument, not a text file. In the video, I'm using the medium Reddit vectors that you can download from the sense2vec repo.

The goal of this workflow is to use the vectors to suggest you similar words and phrases that you can accept or reject to build up a terminology list and patterns to semi-automate annotation in the next step. So there's not really any use for input text in this workflow – the suggestions all come from the vectors.

If you just want to annotate the text in your .txt file with a given list of labels, you can use a recipe like ner.manual instead, which I'm also showing in the next step of the video.

Thanks. Ultimately I want to just create a model from scratch and use that (as my text will have so many domain specific terms). I have next week off work to dedicate this and just wanted to get a handle on the basics before Monday.

Ultimately, by the end of the week I just want a model that I can load into SpaCy to try and auto-annotate/tag a few hundred PDFs.

import spacy

import pandas as pd

df = pd.read_pickle("c:\nhle_desc.pck")

nlp = spacy.load("en_core_web_sm")

subjectList ={}

for entity in doc.ents:
subjectList.update({subjectEntry:entity.text})