Multiple annotators with different data

api
solved

(Kevin) #1

Hi, and thanks in advance for your help.

I’m working with multiple annotators and I want them to always receive different data to annotate, but I do want the different annotators to get the same data once.

In order to get that done I’ve been following the instructions given here and mostly here with the multi_annotator_manager recipe.

What I want to do is to have the data fetched from an API with a user_id attached to the request that will return a set of data and update a database record of which data was given to each user. I have two questions that would really help me get that done.

  1. How can I attach something like a user_id to the request sent to an API? (a valid user_id and password combination will be required to access the prodigy server)
  2. What JSON format does prodigy expect to receive or how can I set that up?

Thank you, and regards.


Deployment of multiple recipes
(Ines Montani) #2

This definitely sounds feasible. Do you want each annotator to annotate with a model in the loop, or do you have static data you want to go through more or less in order?

I think the best solution here would be to put your service in the middle and let it handle the authentication. So, the user “logs on” and makes a request to your service. Your service authenticates the user and creates a session token etc. If this was successful, your service will start a Prodigy session for the user, and pass the user ID and all other details to the recipe. The recipe will then communicate with your service and request a stream of tasks. Your service will know which user is making requests, so it can construct the stream accordingly.

You can also look at the prodigy.serve function inprodigy/__init__.py if you want to implement your own solution that executes a recipe starts the Prodigy server. But I’m not even sure this will be necessary in your case.

You can find more details on the exact formats in your PRODIGY_README.html (available for download with Prodigy). If you’re looking for the format of the annotation tasks in the stream, see the “Annotation task formats” section. A stream is an iterable of dictionaries, with one dictionary describing an annotation task. So your API could simply return a list of objects, e.g. [{"text": "hello world"}] etc.

To avoid exhausting the stream, you might want to write a little wrapper that keeps making requests if the queue is running low. I posted a little example for a custom loader in this thread – the example was supposed to show how to implement your own data loader for Twitter etc., but you can also easily adapt it for your use case:

def custom_loader():
    page = 0   # if API is paged, keep a counter
    while True:
        r = requests.get('http://some-api', params={'page': page})
        response = r.json()
        for item in response['results']:  # or however it's structured
            yield {'text': item['text']}  # etc.
        page += 1  # after page is exhausted, increment

You can also add any other custom properties to your annotation task – like a user identifier. Anything that you add to a task’s "meta" object will be displayed in the bottom right corner of the annotation card in the web app.

When Prodigy processes a stream, it will assign an _input_hash based on the input text, and a _task_hash based on the input and the features to annotate, e.g. the spans or labels. This lets you determine whether two tasks are the same. So your service can look at a the task hashes, and check if a user has already annotated a task. It can also check if the tasks that went out to the user all came back annotated – and if not (for example, if the user just closes their browser and doesn’t save), send them out again.

If you haven’t seen it already, there’s also this thread on using multiple annotator, in which I explain a bit more about the hashing.


User authentication for Prodigy web app
(Kevin) #3

Hi, sorry for the late response. But thank you very much, that was exactly what I needed. I have an API that fetches data entries from a specific folder and adds them to a database, that way it knows which pieces of data were sent to whom and in the case it’s needed, more data can be added dynamically. Prodigy then gets data from that API with the user_id attached so that it knows not to repeat it.

Thank you again!


(Ines Montani) #4

@kevinrosenberg21 Yay, so nice to hear that it’s working! Definitely keep us updated on the results.

Btw, if you ever end up writing a blog post about your solution, let me know and we’d be happy to add it to the docs. You don’t have to share all your code – but just showing how you did it, how it works and how you’re using Prodigy in your annotation projects would be super cool. I’m sure the Prodigy community would love this as well :smiley:


(Kevin) #5

I definitely will. We are currently working on getting our whole solution working (either spaCy with the custom features or a model trained with a different library that has them with a customized prodigy to train it and the multiple annotators’ back end). So once we have a final version that we know is working, at least in our testing environment, we’ll get to work on what we can contribute to the community and how. We are an Argentine startup developing applied AI for human resources backed by one of the largest HR consulting firms in the country and two investment groups with solid plans to expand to the rest of Latin America and eventually Europe and the US. As a first step, we are developing a resume parsing solution using spaCy and prodigy as well as some of our own tools that add domain knowledge. Our plan is to keep using spaCy and prodigy for NLP once this first step is done. If you think there’s more we can contribute we would love to collaborate with you guys.