Hi guys. I’ve noticed that Prodigy seems to not really use a SQL-based backends as “structured”, but rather encodes all of its data as blobs which can then only be accessed through the library. I am not sure if this was a deliberate design decision, but I would say it’s definitely a pretty limiting one.
The main drawback is that maintaining data becomes really hard since it can only be done programatically via the library. I can’t simply join labels created in Prodigy to other data in my DB, let alone even view labels that users have created and allow manually changing them through some other interface (like a postgres browser UI).
I would sincerely request that you consider changing this in future versions, since it would greatly improve the usability of Prodigy and make data management a lot easier.
(If there is some obvious workaround that I am missing of course, please let me know and disregard what I wrote here.)
This is a good question! Ultimately, Prodigy is pretty agnostic to what you pass around in the task dictionary. There are some conventions, like the key "label" or "spans" in some built-in recipes and interfaces, but most of it is up to you. In the default database model we ship with Prodigy, we tried to make fewer assumptions about what the example data means and how to translate it to a database schema. When we built Prodigy, we didn’t want to make any decisions here that’d be very difficult to reverse and potentially lock users in and cause migration issues down the line. Instead, we focused on making the Database handler fully customisable.
If you do have more specific requirements (or opinions on how you want your database to be structured), Prodigy lets you pass in your very own Database class that can take full control over how to store and retrieve the data. It just needs to exposes the methods that Prodigy expects. (Also see this thread for more details.)