entity labeling

Is there currently a solution that will allow users to manually highlight entities and tag them with a label? That is our current workflow, and it is important because physicians want to highlight the exact item they are looking for. Also, is it possible for Prodigy to take a directory of text files as annotation data and return the annotated text to the database in gold parse json format? The users want to be able to read an entire clinical note all at once, highlight/tag the entities they are looking for and then move on to the next note. They don’t want to read sentence by sentence rejecting or accepting matched phrases. Is there also a road map for managing annotations among a group of annotators with an adjudication process? Ideally the annotators would work on their own portion of notes, but we want to randomly share small batches of notes among all annotators so we can check concordance.

Thanks for your questions!

For the first release, we've mostly been focusing on the capabilities of Prodigy as a developer tool and reengineering traditional annotation processes to help developers iterate on the data and run experiments faster. This means that the current workflows aim at making the annotator do as little as possible and using the interface to focus on one decision at a time to move as quickly as possible. (You can see our latest NER video tutorial for an example of a development workflow like this. It also shows the use of word vectors and terminology lists to pre-label entities.)

I totally understand your process, though – in some cases, it definitely makes sense to work through an entire document manually and at once, and label everything that needs to be labelled. This is currently not possible in Prodigy. However, we are working on new interfaces for those types of use cases, including text, images (object detection and segmentation), as well as potentially audio files.

I'm not sure I understand the question correctly – do you mean importing raw text data to the database, but from a directory? Currently, prodigy db-in only works for single files. But you can easily process a whole directory using a simple shell script, or run the function from Python:

from prodigy.__main__ import db_in
for filename in directory:
    db_in('my_dataset', filename)

Yes, this is exactly what we had in mind for the Prodigy Annotation Manager. We're currently planning this as a Prodigy add-on, i.e. a separate package you can plug into your Prodigy workflow and that extends the app with more functionality and an annotation management console that lets you orchestrate larger annotation projects, handle quality control etc. We don't have a timeline for this yet, but it's definitely something we've been thinking about a lot, and have been experimenting with.

Thank you Ines! This is helpful. We are in the market for an annotation tool right now. I think prodigy is great for a developer to create a model with limited availability of expert annotators. You may be aware, but physicians are a bit picky about workflows and they wanted it to work similar to past tools. I will share what you mentioned with my manager. I like the progress of the entire explosion AI ecosystem so far. You are creating something really cool here!

@jeweinb Thanks! And yes, I definitely understand that. Ultimately, the underlying idea behind Prodigy is that you’ll be able to build workflows that require less manual work, and let you use the annotators’ time more efficiently. This is especially relevant in cases like yours, where you’re relying on domain experts to do the labelling.

You might still need some number of manual annotations – but by testing out different approaches upfront and annotating with a model in the loop, you’ll eventually be able to focus on having the domain expert annotators (in your case, physicians) work with the model and give it feedback on the edge cases that are most relevant for training, i.e. the predictions the model is most uncertain about. Because the annotators will only have to click accept or reject, they’ll be able to annotate more examples (or you’ll need to ask for less of their time). And because the model in the loop is updated at runtime, it’ll be able to ask better questions as it learns from the feedback.

That said, we actually shipped an ner_manual interface in the latest update to Prodigy. You can check it out in the live demo here. We’re also planning a number of improvements in the upcoming version – for example, you’ll also be able to pre-define spans to be highlighted, that the user can then accept, delete or modify.