Is there a way to force the recipes like ner.correct to use GPU when I load a model trained with a GPU? I looked in the ner.py file and did not see a way to set this parameter (e.g. spacy.require_gpu(0)).
Hi Kyle!
I don't think we currently allow for this, so I'd like to understand the use-case better. Is the inference too slow? If not, what's the reason you'd like to have a GPU here? If it is, could you describe the documents that you're annotating? I've not heard of people experiencing a serious lag before during ner.correct
.
I have a JSONL file with 350K+ lines. I meant to use GPU for ner.teach
. Sometimes loading the file takes a few minutes. I load this large JSONL so I can try to find those outliers that would not be detected if I chunked the file up and loaded a chunk of the dataset.
I figured I'd give this a spin locally. First, I generate some data.
import srsly
def make_many(n=1_000_000):
for i in range(n):
yield {"text": f"I am Vincent and this is example #{i}."}
srsly.write_jsonl("examples.jsonl", make_many())
This generates a file with 1M examples on disk that's about 50Mb. The documents themselves aren't huge, but there's a lot of them. Next, I am able to run the ner.teach
recipe just fine without any lag.
> python -m prodigy ner.teach issue-6423 en_core_web_sm examples.jsonl --label PERSON
Within a few seconds I see the server message.
Using 1 label(s): PERSON
Added dataset issue-6423 to database SQLite.
✨ Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!
And I'm able to annotate just fine. So it doesn't seem like it's behavior is something I'm able to reproduce just yet.
It could be that you're experiencing a lag because you're dealing with much bigger documents but I'm a bit surprised the startup takes so long on your end. The reason why this recipe loads quickly is that ner.teach
doesn't loop over all the examples immediately on startup. It merely checks the current batch under consideration.
I'd like to understand the lag a bit better though, is there anything else you can share about your setup? How long are your documents? Is there anything specific about the model that you're using in ner.teach
? Does the issue go away when you use a smaller file? You can quickly create one via:
head -100 examples.jsonl > subset.jsonl
Thanks @koaning. I had batch_size
in the prodigy.json
file at 1000
and noticed an improvement when setting it to 50
. My documents are job postings that can range from a paragraph or two, to several paragraphs. I also have a very long custom_theme
and labels
that may be adding some overhead.
Ah, yeah that would explain it.
Out of curiosity, is there a reason why you've set up a larger batch size?