Annotation for document segmentation

Dear Explosion AI team,

I am wondering if there is an option in Prodigy to annotate blocks of texts and highlight the annotation as blocks such as:


I think it would be useful for document segmentation tasks. In my case, I am trying to implement an annotation workflow for CV parsing with Prodigy, one step of which is to segment the work experience section into experience items. However, when we annotate items, it is hard to see in the interface whether two consecutive items are annotated as one or two entities:

I understand that for many use cases it is best to annotate the first (and last, if needed) token of each segment, instead of the blocks - but for this task it doesn't look very good either:

Additionally, the annotators would have to switch between two tags for start token and end token.

Can you please advice on how to go about this?

Hi! This is an interesting question! Framing segmentation as a sequence labelling task definitely seems unnecessarily complicated and a lot more work (with more potential for human error). The second approach of only labelling the start and end tokens would have been my initial recommendation – especially if you have raw unsegmented text and no way of knowing where a segment may start or end.

In your case, it seems like you can use line breaks to at least get a rough idea of the blocks, right? To achieve a UI like the one in your first screenshot, you could make the card slightly wider and use am monospace font to assure that the individual lines don't break. Then you could add one entry to the "tokens" for each line. This would make each line selectable by double-clicking it, and you could more easily drag across to select multiple lines (without having to actually select all words).

Hi! Thank you, that's a very good idea to change the tokenizer this way, it became much easier to select multiple lines. I'm now trying to customize the UI to show the selected text as a block, not sure yet if that's possible.

Yay, glad it worked :slightly_smiling_face:

You could try setting .prodigy-content span, .prodigy-content mark { display: block } in the global_css. This will make each "token" and selection (in your case, the lines) a block element.

Just did a quick test locally and the selection and double-clicking still works as expected. If you also add something margin-bottom: 5px, it actually ends up looking very similar to your first screenshot.

1 Like

Awesome, thanks a lot!

1 Like