Workflow for sequential sentence classification


I'm interested in annotating all sentences within a document with some sentence-level labels but I want to display the whole document to the annotator so each sentence is seen in context.

The Text Classification recipe does sentence-level labels but doesn't show the whole document. The NER recipe can display a long document but the labels are token-level.

Here's an example data (and paper) that shows the desired end result

Do you have any ideas for a custom recipe that would support this workflow?

Hi! I think your use case sounds very similar to the one described in this thread and I shared some ideas for a solution in the comment:

Thank you, Ines.

Both 1) sentence-is-a-token and 2) choice interface would help, especially if there are few possible labels.

What if I have 5+ non-exclusive labels (i.e. multi-label classification for each sentence) and/or I have categories that are not binary but are ideally annotated with real values? I understand I can't have overlapping span tags so with, say, 10 non-exclusive labels, I'd need to re-do the example 10 times with option 1). With option 2) my list of choices would be fairly long too - say 10 sentences * 10 choices. Any ideas or am I moving too far away from the workflows you are optimising for?

Yeah, I see the point! How much context do you need around the sentences in order to make the annotation decision? If it's just like one additional sentence on either side, you could stream in each sentence together with the previous and next one, highlight the current focus sentence (e.g. by adding an entry in the "spans" or by displaying the context in grey) and then add choice "options" for the labels you want to assign.

So the tasks could look like this and you could render it with the choice UI with "choice_style": "multiple" to allow multiple selections.

    "html": "<span style='color: grey'>Previous sentence.</span> Current sentence. <span style='color: grey'>Next sentence.</span>",
    "options": [
        {"id": 1, "text": "LABEL 1"},
        {"id": 1, "text": "LABEL 2"}
        // etc.

Thanks, I think I almost got it to work by splitting into sentences and building html for each sentence + context so having each focus sentence as a separate task/example. I also had to overwrite the hashes using set_hashes.

When I start the first annotation, the document-sentences show up in the right order starting from the first document in my dataset but every time I refresh I get to a random highlighted sentence in a random document. I don't have any sorting logic in my recipe so I'm not sure why this is happening. I'd like to continue from where I left off. Does it have anything to do with the hashes?

Related question, what would be the best way to keep the information that all the sentence-level annotations form a single document? Is there anything better than adding key/values for doc_idx/sentence_idx to each annotation and re-group them afterwards outside of Prodigy?

Ah, I think that's because Prodigy will send out the next batch by default when you reload and request another batch. If a batch hasn't come back by the end of a session, it will be resent when you restart the server. You can set "force_stream_order": True in the "config" returned by your recipe to ensure that tasks are always re-sent in the exact same order until they're answered. (Just make sure you don't have multiple people connecting to the same session then, otherwise you may see duplicates.)

I think adding two IDs (document, sentence in document) would be the most straightforward approach, yes. This makes it pretty easy to extract the labels for any given sentence later on.

This is all looking quite smooth, thank you.

A couple of UI questions that come up because of longer text:

  1. The longer text hides the choice options because they appear after the text. Is it possible to display the option to the right/left of the main text? I only found that you can change the colours here.
  2. Is it possible to extend the History of annotation? Sometimes I realise I need to go back (backspace shortcut) many sentences but the history seems to keep only a dozen or so by default.