Training on part of the custom annotations

Is it possible to train on a part of the custom annotations with prodigy/spacy?

For example, if I have a dataset where I annotated: Name, education, phone number, plus some other things - is it then possible to make a separate model for finding each of the entities? For example a model to just find education, that is only trained on the annotated education.

Hi! If you've annotated all labels in one dataset, one option would be to separate them into multiple sets (e.g. one per label, or one for the label you're most interested in) by connecting to the database in Python:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("your_dataset_with_all_labels")
new_examples = []
for eg in examples:
    spans = [span for span in eg.get("spans", []) if span["label"] == "EDUCATION"]
    eg["spans"] = spans
    new_examples.append(eg)

db.add_dataset("your_dataset_education")
db.add_examples(new_examples, ["your_dataset_education"])

You now have one dataset with only the spans you labelled as EDUCATION and you'll be able to run experiments with it separately.

Yeah that's what I ended up doing, was just wondering if there was an inbuilt function for it - seeing as there are so many other smart functions.

Thanks for replying :slight_smile:

Yeah, maybe we should expose something like it as a utility – I think the only tricky part is that there might be so many different combinations of things that a user could want. You might want to filter by span labels, or by selected options, or by top-level labels, or by relation labels, or by some combination.

Btw, if you know jq, there's probably a super smart and magical way to do all of this in a simple one-liner but... I don't know it well enough to give you the solution :sweat_smile:

Yeah not sure how you should design it. It's probably easier for people to just make a python script to fix it. :slight_smile: