Hi all, we are a team of 4-5 members planning to annotate smaller subsets of a larger dataset on separate prodigy instances. Would like to check feasibility & options available to merge the annotation models from these subsets to create a model for entire dataset. Any help / insights will be great.
Hi! When you annotate with Prodigy, you can specify the name of a dataset to save the annotations to. So when you annotate, you can save your annotations to a dataset like ner_project_datawizard
. Once you're all done annotating, you can use the db-merge
command to create one "master dataset" with all annotations and then train your model from that set.
There are different approaches for dividing up the work, but it's often a good idea to have a little bit of overlap so you can compare the decisions and make sure everyone's following the same annotation strategy. (For instance, if you're annotating person names and one team member always includes titles like "Dr" in the entity while everyone else doesn't, you want to find out about this asap and adjust. Otherwise, your model might end up significantly worse because it has to learn from inconsistent data.)
If you do end up with conflicting annotations, Prodigy obviously can't just solve that for you – but it can help you resolve the conflicts and create a final corrected dataset using the new review
recipe and interface. I posted a little screen recording of it on Twitter a while ago:
All examples you annotate receive hashes, so Prodigy is able to tell which annotations relate to the same example. It can then show them to you in a condensed interface and ask you to have the "final word".
Thanks, appreciate the inputs. We will apply those and get back if further help is needed.