ner.teach very slow

Hi,

I am facing an issue with prodigy. I am training on a dataset containing 871725 sentences each containing on average of 20-30 words.

I am using python 3.6.5(anaconda) on a Mac. Prodigy is taking more than an hour for it to learn about one custom entity type when I run the ner.teach command.

Could you please help me with this.

Thank you.

Hi! Could you elaborate on this? What exactly is taking an hour? Collecting the annotations using ner.teach? Or training a model on the dataset using ner.batch-train? Or just starting the Prodigy server?

Hi, It is taking an hour when I use ner.teach for collecting the annotations. when I run the command, it is taking 1 hour time to finally get the following on my prompt

✨ Starting the web server at http://localhost:8080 ...

Ohhh, okay – thanks for the clarification. I think I know what the problem is.

What format does your dataset have and are the sentences all one string? When you load in the data from a text file, Prodigy will try to stream it in line by line and split each text into individual sentences. This is usually quite fast, because it can be processed as a stream – but if the first (and only) line consists of 800k sentences, Prodigy has to read it all in, process it all with spaCy and split it into sentences before you can get started. This needs a lot of memory, which your machine likely doesn’t have.

So if that’s what’s going on, try to provide your texts in a format that can be read in line-by-line, e.g. .txt or JSONL (newline-delimited JSON) with one sentence or paragraph per line.

Hi,

Thanks for the reply. But my input is already in the format of JSONL, where each json is one sentence.

Thank you.

Okay, that’s definitely strange! :thinking:

Could you share the exact command you ran? And if you set the environment variable PRODIGY_LOGGING=basic when you run ner.teach, is there anything in the log that looks suspicious? Does it get stuck, does it keep looping over something?

The command I am using is,

PRODIGY_LOGGING=basic prodigy ner.teach outboard_labels en_core_web_lg utterance.jsonl --label outboard --patterns item_type_entity/outboard/outboard.jsonl 

and the log is

Using 1 labels: outboard
19:44:42 - RECIPE: Starting recipe ner.teach
19:44:42 - LOADER: Using file extension 'jsonl' to find loader
19:44:42 - LOADER: Loading stream from jsonl
19:44:42 - LOADER: Rehashing stream
19:44:55 - RECIPE: Creating EntityRecognizer using model en_core_web_lg
19:45:19 - MODEL: Added sentence boundary detector to model pipeline
19:45:19 - MODEL: Loading match patterns from disk
19:45:19 - MODEL: Adding 3 patterns
19:45:19 - MODEL: Ensure pattern labels are added to EntityRecognizer
19:45:19 - RECIPE: Created PatternMatcher and loaded in patterns
19:45:19 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
19:45:19 - CONTROLLER: Initialising from recipe
19:45:19 - VALIDATE: Creating validator for view ID 'ner'
19:45:19 - DB: Initialising database SQLite
19:45:19 - DB: Connecting to database SQLite
19:45:19 - DB: Loading dataset 'outboard_labels' (0 examples)
19:45:19 - DB: Creating dataset '2018-06-26_19-45-19'
19:45:19 - CONTROLLER: Validating the first batch
19:45:19 - CONTROLLER: Iterating over stream
19:45:19 - PREPROCESS: Splitting sentences
19:45:19 - FILTER: Filtering duplicates from stream
19:45:19 - FILTER: Filtering out empty examples for key 'text'
19:45:19 - MODEL: Predicting spans for batch (batch size 64)
19:45:20 - MODEL: Sorting batch by entity type (batch size 32)
19:45:20 - MODEL: Predicting spans for batch (batch size 64)
19:45:20 - MODEL: Sorting batch by entity type (batch size 32)
19:45:20 - MODEL: Predicting spans for batch (batch size 64)
19:45:20 - MODEL: Sorting batch by entity type (batch size 32)
19:45:20 - MODEL: Predicting spans for batch (batch size 64)
19:45:21 - MODEL: Sorting batch by entity type (batch size 32)
19:45:21 - MODEL: Predicting spans for batch (batch size 64)
19:45:21 - MODEL: Sorting batch by entity type (batch size 32)
19:45:21 - MODEL: Predicting spans for batch (batch size 64)
19:45:21 - MODEL: Sorting batch by entity type (batch size 32)```

Thanks for sharing!

One potential problem I see here: You’re only annotating one label outboard, which is a new label that the existing model knows nothing about yet. So in the beginning, all suggestions you see will be based on the patterns. If Prodigy can only find very few matches in your data, it might take a while to generate the first batch of 10 suggestions.

So as a solution, you could either use a more specific set of texts that will produce more matches, or adjust the patterns to make sure you can start off with enough examples. If you already have some annotations for your outboard label, you could also pre-train the model to make sure it at least predicts something for that category, which you can then accept/reject, to get over the cold start.