Prediction model using prodigy trained model runs very slow

I am trying to create a label prediction model that can read the text and extract the trained labels that I prepared in prodigy. To do that I created an annotated dataset in prodigy and created a trained model and used the best model from prodigy with spacy-model and "roberta-large" to predict the labels form texts in my df_text database. However, the model runs really slow, I am not sure whether the problem is with the trained prodigy model. I would be very appreciative if you could let me know whether I am using the trained prodigy model in a correct way. Also, the accuracy of the trained model in prodigy was 0.79 for the best model. Please see the code in below. Thanks in advance.

def load_jsonl(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            data.append(json.loads(line))
    return data

def annotate_text(row, patterns):
    text = row['text'].lower()
    annotations = {
        'sedan': [],
        'truck': [],
        'suv': [],
        'crossover': []
    }

    for entry in patterns:
        try:
            pattern_str = ' '.join([re.escape(p['lower']) for p in entry['pattern']])
            if re.search(r'\b' + pattern_str + r'\b', text):
                annotations[entry['label']].append(pattern_str)
        except re.error as e:
            print(f"Regex error with pattern: {pattern_str}")
            print(f"Error: {e}")

    
    row['sedan_label'] = ', '.join(annotations['sedan'])
    row['truck_label'] = ', '.join(annotations['truck'])
    row['suv_label'] = ', '.join(annotations['suv'])
    row['crossover_label'] = ', '.join(annotations['crossover'])
    return row


prodigy_model_path = './model-best' 
ner_model = spacy.load(prodigy_model_path)


spacy_model = spacy.load('en_core_web_lg')


def extract_gpe(text):
    doc = spacy_model(text)
    return ", ".join([ent.text for ent in doc.ents if ent.label_ == "GPE"])



texts_input_df = df_text.apply(lambda row: annotate_text(row, load_jsonl('./Total.jsonl')), axis=1)


print("Columns in DataFrame:", texts_input_df.columns)


required_columns = ['sedan_label', 'truck_label', 'suv_label', 'crossover_label']
missing_columns = [col for col in required_columns if col not in texts_input_df.columns]
if missing_columns:
    print("Missing columns:", missing_columns)
    
    for col in missing_columns:
        texts_input_df[col] = ""    
    

texts_input_df['geopolitical_label'] = texts_input_df['text'].apply(extract_gpe)


def prepare_dataset(df):
    def label_encoder(label_dict):
        return [
            int(bool(label_dict['sedan'])),
            int(bool(label_dict['truck'])),
            int(bool(label_dict['suv'])),
            int(bool(label_dict['crossover']))
        ]
    df['encoded_labels'] = df.apply(lambda row: label_encoder({
        'sedan': row.get('sedan_label', ''),
        'truck': row.get('truck_label', ''),
        'suv': row.get('suv_label', ''),
        'crossover': row.get('crossover_label', '')
    }), axis=1)
    return Dataset.from_pandas(df[['text', 'encoded_labels']])




train_df, eval_df = train_test_split(texts_input_df, test_size=0.1)
train_dataset = prepare_dataset(train_df)
eval_dataset = prepare_dataset(eval_df)

model_name = "roberta-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=9, label2id={
    'O': 0, 'B-sedan': 1, 'I-sedan': 2, 'B-truck': 3, 'I-truck': 4, 
    'B-suv': 5, 'I-suv': 6, 'B-crossover': 7, 
    'I-crossover': 8
})

id2label = {id: label for label, id in model.config.label2id.items()}

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['text'], truncation=True, padding="max_length", max_length=512, return_tensors="pt")
    labels = []
    for i, doc_labels in enumerate(examples['encoded_labels']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [-100 if word_id is None else doc_labels[word_id] if word_id < len(doc_labels) else -100 for word_id in word_ids]
        labels.append(label_ids)
    tokenized_inputs['labels'] = labels
    return tokenized_inputs

train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
eval_dataset = eval_dataset.map(tokenize_and_align_labels, batched=True)

metric = load_metric("seqeval")
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    true_predictions = [[id2label[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
    true_labels = [[id2label[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
    return metric.compute(predictions=true_predictions, references=true_labels)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_dir="./logs",
    logging_steps=50,
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
model.save_pretrained('./saved_model')

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

def update_df_with_predictions(text):
    predictions = ner_pipeline(text)
    geopolitical_label = extract_gpe(text)
    result = {
        'sedan': ", ".join([pred['word'] for pred in predictions if pred['entity_group'] == 'sedan']),
        'truck': ", ".join([pred['word'] for pred in predictions if pred['entity_group'] == 'truck']),
        'suv': ", ".join([pred['word'] for pred in predictions if pred['entity_group'] == 'suv']),
        'crossover': ", ".join([pred['word'] for pred in predictions if pred['entity_group'] == 'crossover']),
        'geopolitical': geopolitical_label
    }
    return result

texts_input_df.update(texts_input_df['text'].apply(update_df_with_predictions))

Hi @AmirNickkar ,

It's unclear to me, based on the provided script, how you actually use the Prodigy model for predictions. It looks like you use roberta-large via hugging face transformers pipeline (although it is not imported or defined anywhere in the script) to generate predictions and then you use spaCy pre-trained pipeline to get GPE labels.
I don't think your current script even uses your Prodigy ner_model at all!

When you say that the model "runs really slow" I think you are actually referring to the roberta-large and spaCy en_core_web_lgwhich are used in sequence - which means two consecutive forward passes across your entire dataset - that can get slow for sure.
Also your prediction function (update_df_with_predictions) uses a per-row NER pipeline call. For faster inference you should batch process texts instead of one-by-one processing.
If you implemented your pipeline as a spaCy pipeline, you could use spaCy pipe batch processing.

Unrelated with the prediction, you might also address some inefficiencies with data processing e.g.:

  • you could precompile regex patterns outside the loop to avoid recompilation for each row.
  • apply may get slow if the dataset is really big - in that case you might consider using multiprocessing
    As for the training you might try using mixed-precision and larger batch sizes for training efficiency.

It's still unclear to me what's the role of Prodigy trained model in your workflow, but I think you might want to:

  1. bootstrap your NER annotation with patterns and create a manual NER dataset in Prodigy
  2. use that dataset to train the NER component of a roberta-large based pipeline
  3. compile a pipeline with two NER components one for GPE and the other for your custom labels.

If that's the case, Prodigy can save a lot of scripting for you! The recommended workflow would be:

  1. Create NER dataset using patterns with ner.manual: docs
  2. Use this dataset to train roberta-large based pipeline (Prodigy will take care of aligning the tokenization) - see the docs for the efficient annotation for transformers
  3. Compile a spaCy pipeline with two NER components: one for GPE sourced from en_core_web_lg and the custom one for the other labels - see spaCy tutorial on the topic.

Let me know, of course, if I somehow misread your intentions and we'll take it from there :slight_smile:

@magdaaniol , Thank you so much for your answer. It was very helpful and helped me to understand how to implement prodigy model in creating a label prediction model even better. Let me clarify what I want to do, I want to create a label prediction model to be able to pick the labels from a text based on the semantic analysis. For example, if I annotated "BMW company" with the label of auto in a text with prodigy, the model should be able to pick BMW co. as well. Or, to teach a model that not all "golf" in a text should be picked up as a labeled entity for vehicle models. To do so, I first tried to train a model on sample texts using prodigy

python -m prodigy ner.manual auto_trained_text blank:en .\prodigy_auto_training_text.jsonl --label sedan_label,suv_label,truck_label,crossover_label --patterns .\auto_annotations.jsonl

And then after completing that I started to train the model using the command in below and output accuracy very good

python -m prodigy train --ner auto_trained_text en_core_web-lg --eval-split 0.2

Then I picked up the best model to be used in creating label prediction model. Based your previous comments, I modified the code fundamentally and also read the docs and instructions however, I still have some difficulties in two areas:
1- what do you mean by "bootstrap your NER annotation with patterns and create a manual NER dataset in prodigy". Currently , the code does bootstrap NER annotation using precompiled regex patterns to identify and label text data automatically. Are you asking integrating with prodigy to manually refine or expand this dataset?
2- The code involves using a spacy model for extracting geopolitical entities and a separate NER model trained with the roberta-large transformer for those other four labels. You asked me to put them into a into a single pipeline while in prodigy I did not annotate geopolitical entities, how is it possible? I am confused a little.

This is the revised code for your attention. Also, please let me know if there is a video or documents in prodigy other than what you mentioned before in this regards. Thank you once again.


def load_jsonl(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            data.append(json.loads(line))
    return data


def compile_patterns(patterns):
    compiled_patterns = []
    for entry in patterns:
        try:
            pattern_str = ' '.join([re.escape(p['lower']) for p in entry['pattern']])
            regex = re.compile(r'\b' + pattern_str + r'\b')
            compiled_patterns.append((entry['label'], regex))
        except re.error as e:
            print(f"Regex error with pattern: {pattern_str}")
            print(f"Error: {e}")
    return compiled_patterns


def annotate_text(row, compiled_patterns):
    text = row['text'].lower()
    annotations = {
        'sedan': [],
        'truck': [],
        'suv': [],
        'crossover': []
    }
    for label, regex in compiled_patterns:
        matches = regex.findall(text)
        if matches:
            annotations[label].extend(matches)
    for label in annotations:
        row[label + '_label'] = ', '.join(annotations[label])
    return row


def parallel_apply(df_texts, func, args):
    workers = multiprocessing.cpu_count()
    with multiprocessing.Pool(workers) as pool:
        result = pool.starmap(func, [(row, args) for index, row in df_texts.iterrows()])
    return pd.DataFrame(result)


prodigy_model_path = '.../model-best'
ner_model = spacy.load(prodigy_model_path)
spacy_model = spacy.load('en_core_web_lg')


def process_texts_in_batches(texts):
    ner_results = list(ner_model.pipe(texts))
    gpe_results = list(spacy_model.pipe(texts))
    batch_results = []
    for ner_doc, gpe_doc in zip(ner_results, gpe_results):
        entities = {
            'sedan': [],
            'truck': [],
            'suv': [],
            'crossover': [],
            'geopolitical': []
        }
        for ent in ner_doc.ents:
            if ent.label_ in entities:
                entities[ent.label_].append(ent.text)
        entities['geopolitical'] = [ent.text for ent in gpe_doc.ents if ent.label_ == "GPE"]
        batch_results.append(entities)
    return batch_results


patterns = load_jsonl('.../auto_annotations.jsonl')
compiled_patterns = compile_patterns(patterns)

df_texts = parallel_apply(df_texts, annotate_text, compiled_patterns)
batched_entities = process_texts_in_batches(df_texts['text'].tolist())
for i, entities in enumerate(batched_entities):
    for key in entities:
        df_texts.loc[i, key + '_label'] = ', '.join(entities[key])


def prepare_dataset(df_texts):
    def label_encoder(label_dict):
        return [int(bool(label_dict[label])) for label in ['sedan', 'truck', 'suv', 'crossover']]
    df_texts['encoded_labels'] = df_texts.apply(lambda row: label_encoder(row), axis=1)
    return Dataset.from_pandas(df_texts[['text', 'encoded_labels']])


train_df, eval_df = train_test_split(df_texts, test_size=0.1)
train_dataset = prepare_dataset(train_df)
eval_dataset = prepare_dataset(eval_df)


model_name = "roberta-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=9, label2id={
    'O': 0, 'B-sedan': 1, 'I-sedan': 2, 'B-truck': 3, 'I-truck': 4,
    'B-suv': 5, 'I-suv': 6, 'B-crossover': 7,
    'I-crossover': 8
})


def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['text'], truncation=True, padding="max_length", max_length=512)
    labels = []
    for i, doc_labels in enumerate(examples['encoded_labels']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [-100 if word_id is None else doc_labels[word_id] if word_id < len(doc_labels) else -100 for word_id in word_ids]
        labels.append(label_ids)
    tokenized_inputs['labels'] = labels
    return tokenized_inputs


train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
eval_dataset = eval_dataset.map(tokenize_and_align_labels, batched=True)


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_dir="./logs",
    logging_steps=50,
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=lambda p: load_metric("seqeval").compute(predictions=np.argmax(p.predictions, axis=2), references=p.label_ids)
)

trainer.train()
model.save_pretrained('.../saved_model')

Hi @AmirNickkar ,

Thanks for providing more context and the improved script. I think it will be easier to maintain/test now and hopefully you can see some performance improvements as well.
In general your solution is definitely aligned with your goals. My recommended workflow was meant to point you to the relevant Prodigy and spaCy functionalities so that you don't have to implement them yourself (like taking care of aligning Prodigy labels to transformer labels) in case it's more convenient.

With regard to your additional questions:

what do you mean by "bootstrap your NER annotation with patterns and create a manual NER dataset in prodigy".

I just meant implementing your regex as Prodigy patterns and use it directly with the ner.manual recipe like you did with auto_annotations.jsonl so that you have a chance to correct it while you do the manual annotation. It of course depends on the patterns, sometimes it might be more practical with the regex.

You asked me to put them into a into a single pipeline while in prodigy I did not annotate geopolitical entities, how is it possible? I am confused a little.

It's actually a very common pattern to have different NER components that specialize in different subset of labels. This is why spaCy allows to have multiple NERs in the pipeline. This comment about two separate NERs was actually meant for production not development in Prodigy. So once you are happy with the performance of the GPE NER module and your custom NER model which can be developed and evaluated independently, you can put them together in your production spaCy pipeline as described in the tutorial I shared previously. This is just a suggestion of course.

Hi @magdaaniol

Thank you for your answer. I removed the GPE label from the model and also switched from BERT (Roberta-large) to spacy model for tokenizing to speed up the process. I am still struggling with the speed of the model. I thought incorporating Spacy with Prodigy trained model (as prodigy is basically spacy-based) would speed up the process however, It was not that much. Now, I am convincing somewhere I am wrong, so I tried to step back and make everything simple as possible. Simply first I created another NER dataset using patterns with ner.manual only for those custom label on smaller texts and got a higher quality trained model. Now I created a model using the code below on a df with 2000 texts and try to apply obtained trained model on the same structured df but with 500k size. The problem is both process are still slow . Now I am still confused. I have three questions now.

1- Can I use only prodigy-trained model in pipeline without using en_core_web_lg spacy model to create a label prediction model?

def load_jsonl(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            data.append(json.loads(line))
    return data

start_time = time.time()
patterns = load_jsonl('./labels.jsonl')
compiled_patterns = [{'pattern': re.compile(r'\b' + ' '.join([p['lower'] for p in entry['pattern']]) + r'\b'), 'label': entry['label']} for entry in tqdm(patterns, desc="Compiling patterns")]
end_time = time.time()
print(f"Load and compile patterns: {end_time - start_time:.2f} seconds")

def annotate_text(row):
    text = row['text'].lower()
    annotations = {'sedan': [], 'suv': [], 'truck': [], 'crossover': []}
    for entry in compiled_patterns:
        if entry['pattern'].search(text):
            annotations[entry['label']].append(entry['pattern'].pattern.strip(r'\b'))
    row['sedan_label'] = ', '.join(annotations['sedan'])
    row['suv_label'] = ', '.join(annotations['suv'])
    row['truck_label'] = ', '.join(annotations['truck'])
    row['crossover_label'] = ', '.join(annotations['crossover'])
    return row

def parallel_annotate_text(df):
    with ThreadPoolExecutor() as executor:
        return pd.DataFrame(list(tqdm(executor.map(annotate_text, [row for index, row in df.iterrows()]), total=len(df), desc="Annotating text")))

start_time = time.time()
texts_input_df = parallel_annotate_text(df_texts)
end_time = time.time()
print(f"Parallel text annotation: {end_time - start_time:.2f} seconds")

start_time = time.time()
train_df, eval_df = train_test_split(texts_input_df, test_size=0.1)
end_time = time.time()
print(f"Dataset split: {end_time - start_time:.2f} seconds")

start_time = time.time()
prodigy_model = './model-best'
nlp = spacy.load(prodigy_model)
nlp.tokenizer = Tokenizer(nlp.vocab)
nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner'])
end_time = time.time()
print(f"Custom Prodigy model loading and configuration: {end_time - start_time:.2f} seconds")

def process_text(text):
    doc = nlp(text)
    return doc

def update_df_with_predictions(doc, record_id):
    sedan_df = pd.DataFrame(
        {'record_id': record_id, 'entity': [ent.text for ent in doc.ents if ent.label_ == "sedan"]}
    )
    suv_df = pd.DataFrame(
        {'record_id': record_id, 'entity': [ent.text for ent in doc.ents if ent.label_ == "suv"]}
    )
    truck_df = pd.DataFrame(
        {'record_id': record_id, 'entity': [ent.text for ent in doc.ents if ent.label_ == "truck"]}
    )
    crossover_df = pd.DataFrame(
        {'record_id': record_id, 'entity': [ent.text for ent in doc.ents if ent.label_ == "crossover"]}
    )
    return sedan_df, suv_df, truck_df, crossover_df

start_time = time.time()
texts = [row['text'] for index, row in texts_input_df.iterrows()]
docs = list(tqdm(nlp.pipe(texts, batch_size=1000), total=len(texts), desc="Processing texts"))

with Pool() as pool:
    results = list(tqdm(pool.starmap(update_df_with_predictions, [(doc, texts_input_df.loc[i, 'text_id']) for i, doc in enumerate(docs)]), total=len(docs), desc="Updating dataframes"))
    sedan_dfs, suv_dfs, truck_dfs, crossover_dfs = zip(*results)

sedan_output_df = pd.concat(sedan_dfs, ignore_index=True)
suv_output_df = pd.concat(suv_dfs, ignore_index=True)
truck_output_df = pd.concat(truck_dfs, ignore_index=True)
crossover_output_df = pd.concat(crossover_dfs, ignore_index=True)
end_time = time.time()
print(f"Dataframe update with predictions: {end_time - start_time:.2f} seconds")

def evaluate_model(eval_df):
    true_labels = []
    pred_labels = []

    for _, row in eval_df.iterrows():
        doc = nlp(row['text'])
        for ent in doc.ents:
            true_labels.append(row[f"{ent.label_}_label"])
            pred_labels.append(ent.text)

    print("Model Performance Metrics:")
    print(classification_report(true_labels, pred_labels))

# Evaluate the model
evaluate_model(eval_df)

2- If not, how can I incorporate prodigy-trained model with en_core_web_lg spacy model optimally. Do you think this can help.

def combine_entity_results(doc_prodigy, doc_spacy):
    combined_entities = list(doc_prodigy.ents)  # starting with entities from the prodigy model
    # useing a set for efficient lookup of existing entity texts
    prodigy_entitytexts = set((ent.text, ent.label) for ent in doc_prodigy.ents)
    # adding entities from the general spacy model if they don't overlap with those from prodigy_trained one
    for ent in docspacy.ents:
        if (ent.text, ent.label) not in prodigy_entity_texts:
            combined_entities.append(ent)
    return combined_entities

3- Since I am using prodigy here, what would be the best format of ensemble model in output, that will be used as a label prediction model?

Thank you once again.

Hi @AmirNickkar
There seem to be a couple of misunderstandings in how this code is being used:

  1. Model Creation: The code you shared doesn't actually create or train a model - it only loads and uses a pre-trained model from './model-best'. The model creation happened earlier when you used Prodigy's ner.manual command. I assume model creation does not present performance issues?
  2. Duplicate Processing: Your code is doing two separate types of entity recognition:
  • Pattern matching using regex (the annotate_text function)
  • Model inference using the Prodigy model (the nlp.pipe processing)

The regex matches are only used as ground truth in the evaluation function. This means you're running both regex and model inference on all texts, which is likely contributing to your performance issues. Another likely bottleneck is the number of dataframe operations. You should be able to confirm that with your timing measurements or by profiling your code.
To answer your questions:

1- Can I use only prodigy-trained model in pipeline without using en_core_web_lg spacy model to create a label prediction model?

Yes, you can use only the Prodigy-trained model without loading en_core_web_lg. Looking at your script, you're already doing this correctly:

prodigy_model = './model-best'
nlp = spacy.load(prodigy_model)
nlp.tokenizer = Tokenizer(nlp.vocab)
nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner'])

For your use case where you're only interested in custom NER labels, using just the Prodigy model would be more efficient. The en_core_web_lg model is much larger and includes many components you don't need.

2- If not, how can I incorporate prodigy-trained model with en_core_web_lg spacy model optimally. Do you think this can help.

As I mentioned above you can use only your custom pipeline. Otherwise, the best way to combine predictions would be to build spaCy pipeline with two NER components as explained in the tutorial I shared previously. A pretrained spaCy NER won't help much with your custom labels. It would make sense to combine it if you wanted to predict some of the labels it was trained to predict such as GPE, for example.

3- Since I am using prodigy here, what would be the best format of ensemble model in output, that will be used as a label prediction model?

If you want to use the predictions of two models in Prodigy, you should add a function in your custom recipe that adds annotations from these models as to Prodigy task dictionary as Prodigy spans. You can see the exact format in the ner_manual interface documentation. You can add source field to each span dictionary to record the source model. Note that you'll need to reconcile overlapping labels. One way to do that would be to create two separate datasets each annotated by a single model (you can use Prodigy model-as-annotator recipes for this) and then reconcile them using the review recipe. See here to learn more about this pattern for data development: Review · Prodigy · An annotation tool for AI, Machine Learning & NLP

On a more general note, I think it would help if you separated conceptually and programmatically data development from the training and evaluation. It would be easier to see which part is causing performance issues and debug more precisely.
Also by reviewing manually the annotations from your model or patterns can help you to build a better mental model of your dataset and give ideas about why model might be making mistakes.

1 Like