sense2vec 02_preprocessing.py script problem

Hi,
I am trying to train a new sense2vec model following the instructions , but I seem to have a problem:
when running the 02 script, if i use as an input the whole folder with the several .spacy files created as a result of script 01, I get an error.
If, on the other hand, I specify the single file -- fo9r example "corpus-2.spacy" -- it does work.
Can i not specify somehow to directory so that it will run the 01_preprocessing for all the .spacy files in the directory?

thanks

Giulia

Hi!

For future reference - a better place for this type of question would be the spaCy discussion forum, as this question is not directly related to Prodigy.

It does look like this 02 script was written to process one single .spacy file at a time. This might not be 100% clear from the documentation. The readme gives this description:

Load a collection of parsed Doc objects produced in the previous step (...)

which is technically correct because one .spacy file can hold multiple Doc objects.

Also note that the readme mentions

Processing scripts are designed to operate on single files, making it easy to parallellize the work.

So in conclusion, the best solution is to parallellize the processing of the different files outside of this 02 script, if that makes sense to you :slight_smile: