I am looking for tools to assist in a user-assisted name disambiguation task. For example, if I have a network of people in which some know each other, I might predict that Person X knows Person Y. I want the user to confirm or deny, with provided relevant information, if they know each other.
I would need to fairly significantly customize the UI of Prodigy to accommodate the information given to the user. Does Prodigy allow for me to customize this way? If not, do you have recommendations?
Sure, this sounds like a pretty straightforward task! The HTML interface lets you render pretty much anything, so you can include an image, a text, external links and design it however you like. Your system could then, for instance, output data like this:
For more complex formatting, you can also use HTML templates that let you insert values from your JSON-formatted data. The result will be the same, but you won’t have to pre-compile the markup for each example.
A good recipe to start with is the prodigy mark recipe: it takes an input stream of examples and will present them to the annotator as they come in. The annotated data you receive back will be identical to the JSON input – just with an added "answer" that’s either "accept", "reject" or "ignore".
Additionally, I need to crowdsource from a variety of users that will be the annotators. Essentially, I am working on an open data science project in which people will remotely use the application and provide the annotations.
Does Prodigy provide capabilities for this task? If not, do you have any recommendations?
Prodigy is a regular Python library and starts the web server on a given host and port – so you can write your own logic around how to serve the app and when to start and stop the server. This project developed by a Prodigy user is a nice example of this: GitHub - ahalterman/multiuser_prodigy: Running Prodigy for a team of annotators. We're also currently working on an extension product, the Prodigy Annotation Manager, which will help with scaling up projects, creating larger datasets and managing multiple annotators.
The only thing that's important to keep in mind is that the app can't be made publicly available online – only via an internal or password-protected URL and only accessible to your annotators.
@daniel_d We’re hoping to start accepting private alpha/beta users (existing Prodigy users only) in August and start the public beta in September (See the roadmap for more details on what will be included.)
Does this apply to Crowdsourcing with platforms like Figure8 or MechanicalTurk?
If I develop a crowdsourcing task with Prodigy and deploy it in such a way that only crowdworkers that accept the task (“hired”) can access it, would it be within the license agreement?
I think that probably depends on how Figure8 and MechanicalTurk work. I don’t know this in detail sorry, so it’s hard to give a definitive answer.
If you’re able to set it up so that you host the Prodigy annotation tasks, and an authenticated set of users access it and do the work, and you receive the annotations, I think that should be valid. I’d be interested to hear how you go with this.
Yes, I would host it. It works more or less like this:
I give a link to my server to Figure8 how much I’ll pay for individual task (e.g. 1€ per 10 NERs), and how many people I want to hire (let’s say 100). They publish it in their platform, accessible only to their registered users.
If one person likes the deal, clicks the link, gets redirected to my server, my server gives him/her 10 texts to annotate and instructs to Figure8 “pay this good man/woman”. When I get to 100 people, task is taken down of the platform.
In a way, is like hiring annotators through an “Annotators’ Uber”
As a side note, considering Prodigy’s annotation manager roadmap, if you are not doing it already, I would suggest you to take a look at Crowdsourcing/Human-in-the-loop literature, or partner with someone on that space.
I think you’ll probably find that’s actually a much worse way to collect annotations than having a smaller group of annotators working for longer, who you can actually talk to.
Crowdsourcing actually has a lot of problems, and isn’t the most practical approach for most real projects. It’s very popular in the academic literature because it fits a number of constraints specific to academic projects. Academic projects need annotated data collected in bursts, and there’s typically constraints from the university in how much and under what conditions labor can be commissioned. In many institutions, crowdsourcing platforms also provide a way around human subjects ethical review (which shouldn’t really apply to annotation work, but the whole problem is that the board does not always make sensible decisions).
For real projects, annotation usually needs to be ongoing, and it needs to be performed by people who you can give feedback to, and from whom you can get feedback about the schema. If you want to hook up Prodigy to something like AMT or Figure8, and you’re able to do that in a way that works within the licensing constraints, then by all means. But it’s not a feature that’s often requested, and it’s not one that we have planned for our roadmap.