πŸ’« Ideas for Prodigy plugins πŸ’«

Hi all!

We're crowdsourcing ideas for open-source Prodigy plugins! As a reminder, plugins are recipes that are separated out into their own packages because they require a 3rd party library. We've built plugins like:

  1. :page_facing_up: Prodigy PDF: Recipes that allow you to label PDFs
  2. :hugs: Prodigy HF: Recipes that allow you to interact with the Huggingface stack
  3. :shushing_face: Prodigy Whisper: Recipes that leverage OpenAI's Whisper model for audio transcription

...and many more.

What labelling use cases do you have that would benefit from a Prodigy integration with a third-party Python library? What would be your dream Prodigy plugin?

Have been meaning to write a Prodigy integration for ZenML for a while. Would be a nice addition to our supported annotators. But that’s the other way round. Not sure if that’s what you were asking :thinking:

2 Likes

Hi,

I joined recently and am quite new to Prodigy. Great tool.

I would love to see native support for information retrieval, entity resolution, and similar tasks where we annotate pairs of records rather than classify single records.

Here is a rather simple example for entity resolution: Beyond basic recipes with Prodigy by Explosion AI | by Kabir Khan | Medium

An interesting integration beyond just concatenating pairs into simple texts would be candidate selection. For example, using GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors. for clustering somewhat similar records and then drawing pairs from those.

Best,
Paul

1 Like

@dedupedude,

Thanks for the suggestion and great blogpost!

We have some similar-ish plug-ins like Prodigy-ann and Prodigy-lunr that allow you to query your examples to find the most relevant subset for annotation but it's doesn't fully satisfy the use case you're describing. I've added an issue on this for the team to discuss.

The two plug-ins are indeed an interesting starting point. Thanks for sharing.

Re prodigy-ann: you should consider switching to the GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors. library built by facebook research, which also covers HNSW. Not sure how far you get with hnswlib but faiss covers:

  • many more indexing techniques than just HNSW, including good old KNN using different metrics (which should be the preference when number of documents is small, e.g., <10k)
  • comes with GPU support (CUDA on linux only),
  • probably the most established package in this domain (27.7k github stars as of this writing)
3 Likes