Force GPU when annotating models trained with GPU

Is there a way to force the recipes like ner.correct to use GPU when I load a model trained with a GPU? I looked in the ner.py file and did not see a way to set this parameter (e.g. spacy.require_gpu(0)).

Hi Kyle!

I don't think we currently allow for this, so I'd like to understand the use-case better. Is the inference too slow? If not, what's the reason you'd like to have a GPU here? If it is, could you describe the documents that you're annotating? I've not heard of people experiencing a serious lag before during ner.correct.

I have a JSONL file with 350K+ lines. I meant to use GPU for ner.teach. Sometimes loading the file takes a few minutes. I load this large JSONL so I can try to find those outliers that would not be detected if I chunked the file up and loaded a chunk of the dataset.

I figured I'd give this a spin locally. First, I generate some data.

import srsly 


def make_many(n=1_000_000):
    for i in range(n):
        yield {"text": f"I am Vincent and this is example #{i}."}


srsly.write_jsonl("examples.jsonl", make_many())

This generates a file with 1M examples on disk that's about 50Mb. The documents themselves aren't huge, but there's a lot of them. Next, I am able to run the ner.teach recipe just fine without any lag.

> python -m prodigy ner.teach issue-6423 en_core_web_sm examples.jsonl --label PERSON

Within a few seconds I see the server message.

Using 1 label(s): PERSON
Added dataset issue-6423 to database SQLite.

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

And I'm able to annotate just fine. So it doesn't seem like it's behavior is something I'm able to reproduce just yet.

It could be that you're experiencing a lag because you're dealing with much bigger documents but I'm a bit surprised the startup takes so long on your end. The reason why this recipe loads quickly is that ner.teach doesn't loop over all the examples immediately on startup. It merely checks the current batch under consideration.

I'd like to understand the lag a bit better though, is there anything else you can share about your setup? How long are your documents? Is there anything specific about the model that you're using in ner.teach? Does the issue go away when you use a smaller file? You can quickly create one via:

head -100 examples.jsonl > subset.jsonl
1 Like

Thanks @koaning. I had batch_size in the prodigy.json file at 1000 and noticed an improvement when setting it to 50. My documents are job postings that can range from a paragraph or two, to several paragraphs. I also have a very long custom_theme and labels that may be adding some overhead.

1 Like

Ah, yeah that would explain it.

Out of curiosity, is there a reason why you've set up a larger batch size?