I tried reproducing a whole custom vectors setup locally but did not reproduce your error. For completeness, I will share all of the steps.
Start
I started by annotating some data. These are my examples.jsonl
:
{"text": "hi my name is vincent"}
{"text": "hi my name is john"}
{"text": "hi my name is jenny"}
{"text": "hi my name is noa"}
I've annotated these via:
python -m prodigy ner.manual issue-6020 blank:en examples.jsonl --label name
It's a very basic dataset, but it'll do.
Custom Vectors
I created a text.txt
file with the following content to train embeddings.
this file contains some text
not a whole lot
just enough to provide a demo
To train some embeddings I figured I'd use floret. So I install it first.
python -m pip install --upgrade pip
python -m pip install floret
And then I train it. I used the script found here. This is train_floret.py
.
import typer
from pathlib import Path
import floret
def main(
input_file: Path,
output_stem: str,
mode: str = "floret",
model: str = "cbow",
dim: int = 300,
mincount: int = 10,
minn: int = 5,
maxn: int = 6,
neg: int = 10,
hashcount: int = 2,
bucket: int = 20000,
thread: int = 8,
):
floret_model = floret.train_unsupervised(
str(input_file.absolute()),
model=model,
mode=mode,
dim=dim,
minCount=mincount,
minn=minn,
maxn=maxn,
neg=neg,
hashCount=hashcount,
bucket=bucket,
thread=thread,
)
floret_model.save_model(output_stem + ".bin")
floret_model.save_vectors(output_stem + ".vec")
if mode == "floret":
floret_model.save_floret_vectors(output_stem + ".floret")
if __name__ == "__main__":
typer.run(main)
And I trained my embeddings via:
python train_floret.py text.txt vectors --mincount 1
This generates a vectors.floret
file locally, which I can use to bootstrap a spaCy pipeline with vectors.
python -m spacy init vectors en vectors.floret --mode floret custom_model
This creates a folder called custom_model
.
Training Models
I will now train two models. One based off the custom_model
via:
python -m prodigy train --ner issue-6020 --base-model custom_model --training.max_steps=50 --training.eval_frequency=10
This is the epoch table I see at the end:
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 12.67 0.00 0.00 0.00 0.00
10 10 0.13 105.36 0.00 0.00 0.00 0.00
20 20 0.03 6.47 0.00 0.00 0.00 0.00
30 30 0.00 0.00 0.00 0.00 0.00 0.00
40 40 0.00 0.00 0.00 0.00 0.00 0.00
50 50 0.00 0.00 0.00 0.00 0.00 0.00
And another one based on en_core_web_sm
, via:
python -m prodigy train --ner issue-6020 --base-model en_core_web_sm --training.max_steps=50 --training.eval_frequency=10
This gives a different table.
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SPEED SCORE
--- ------ ------------ -------- ------ ------ ------ ------ ------
0 0 0.00 7.72 0.00 0.00 0.00 0.00 0.00
10 10 0.00 55.98 0.00 0.00 0.00 0.00 0.00
20 20 0.00 35.33 0.00 0.00 0.00 0.00 0.00
30 30 0.00 1.54 0.00 0.00 0.00 0.00 0.00
40 40 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50 50 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Back to your problem.
Could you repeat the same exercise on your machine? You don't have to train your own floret vectors but you'll notice that I ran my training script with --training.max_steps=50
and --training.eval_frequency=10
. That allows me to see a difference. If I only looked at steps of 100, then the results might have indeed looked exactly the same.