Potential buyer here. I have audio data that simultaneously needs (1) transcription; (2) entity annotation; and (3) canonical values (e.g. text to digit conversion). Is there a recipe to annotate and transcribe at the same time? How much effort would it take to be able to use Prodigy in this way?
Hi! In general, Prodigy always lets you load in existing datasets and re-annotate them to add more information – so you could run a transcription task and then feed the resulting data forward to a recipe that lets you highlight entities, or a combined interface for entity annotation and canonical values, depending on what you need. You can combine existing interfaces, e.g. to add a text input or multiple choice options to a manual span highlighting UI. See here for an example with the recipe code.
That said, transcribing and annotating entities at the same time isn't easily possible, because Prodigy needs to preprocess the text for entity annotation. However, it's also something we wouldn't recommend.
From what you describe, transcription will likely be the most labour-intensive and probably also most error-prone component of your data creation process. Named entities are all about boundaries and where a span starts and ends matters a lot. So any small mistake in the transcription can easily invalidate all other annotations created at the same time. So it'd probably more efficient to do the transcription first, review the data (even if it's just by taking samples) and then add more annotation layers on top. Focusing on one annotation task at a time can also be more efficient for your annotators, because they don't constantly have to switch between doing different things, which can make it more difficult to focus and which can lead to mistakes.
Any example on how to do
so you could run a transcription task and then feed the resulting data forward to a recipe that lets you highlight entities,
?
I have an exercise which requires annotation and 7 rounds of classification.
Sure! The specifics of course depend on what you want to annotate, but here's a basic workflow example:
Step 1: transcribe the files
prodigy audio.transcribe transcript_data ./files --autoplay
Step 2: Export the data
prodigy db-out transcript_data > transcript_data.jsonl
The resulting data in transcript_data.jsonl
should look like this:
{"path": "./file/audio.mp3", "transcript": "This is the transcript"}
You can then proofread/check that and convert it to data that looks like this and has a key "text"
(I want to add an option to do this automatically in the audio.transcribe
recipe in the future).
{"path": "./file/audio.mp3", "text": "This is the transcript"}
Step 3: Annotate the text further
You can now use your converted transcript_data.jsonl
and annotate it further, for instance, with ner.manual
to label entities:
prodigy ner.manual transcript_ner blank:en ./transcript_data.jsonl --label PERSON,ORG
Or to assign text categories for training a text classifier:
prodigy textcat.manual transcript_textcat ./transcript_data.jsonl --label SPORTS,POLITICS
Or pretty much anything else you can do with raw text in Prodigy
Hi Ines, thanks for the quick reply. We want to split the annotation from modeling and let the annotation be performed by a different team which is more content driven and with less technical background. The problem is to classify a text with multiple choices, therefore with multiple blocks of type choice.
I see your recommendation of not creating such a big interface and I agree with the reasoning. However, there is a logical relationship between categories and therefore a benefit in having all of them at the same time.
What I see as a solution is first classifying the first category, then the second, ..., and after all categories moving to the next task.
I made a gif that might better explain the use case. Can I implement it maybe with validate_answer
? Is a different option available? Is there a recipe from where I can draw inspiration?
Yeah, that makes total sense and is pretty common – however, I'm not sue I'd call it "split annotation from modelling", because that's not something we'd recommend doing. The annotation scheme is crucial for the modelling decisions, so that should be tightly coupled with the development process. Who does the annotation in the end is of course a different question.
I'm not sure where validate_answer
could fit in here. But even if you use a stream generator that checks for the most recent answer, and "instant_submit": True
to make sure an answer is submitted immediately when the user makes a decision, it's difficult to prevent a race condition or have the user wait until the next question is computed on the server.
If you want full flexibility, I think a better approach would be to just implement your own form using an HTML template with checkboxes and custom JavaScript that updates the task object (and possible shows/hides further options). I'm going through this in more detail in my comment here: