I'm a consultant that frequently works with clients that want NER. In some cases, I've been exporting datasets from Pandas to Google sheets and having my clients label with their own team.
For a project I'm working on right now, I'm breaking up a corpus of documents into sentences and building a labeled set based on a few categories (does the sentence contain the named entity of interest). My client told me my awesome google sheet is difficult to use given the lack of context of where the sentences are in the original documents. They suggested having 2 sentences on either side of the sentence of interest in context while labeling the sentence of interest. I could build this in Django but is it possible to format data and use it in a way with Prodigy that allows it to be displayed in this manner for applying a single label to a sentence?
For example:
Sentence 58: I am a sentence.
Sentence 59: There are many like me. Sentence 60: This sentence has an entity of interest!
Sentence 61: This sentence does not.
Sentence 62: Also, this sentence does not.
Buttons: cat 1, cat 2, cat 3, cat 4
Sentence 60 is the one that needs to be labeled but two sentences prior and after are in context. I admit, for this exact project, this would be very helpful given the language. Could this be done with Prodigy?
Hi! It should definitely be possible to present your data this way in Prodigy – after all, the app will render whatever you give it, if it can be rendered with any of the available interfaces or HTML.
The most elegant solution would probably be to stream in your sentences in triples (or wait, quintuples in your case? ) and present the previous and next sentences, with the current sentence highlighted. For example, if you're just classifying the sentence (and not the entities themselves), your input task could look like this:
{
"sentence": "This sentence has an entity of interest!",
"html": "Prev sentence. <strong>This sentence has an entity of interest!</strong> Next sentence",
"options": [{"text": "cat1", "id": "CAT1", ...}]
}
You can then render it with two blocks: a html block to render the HTML, and a choice block for your 4 category options. When you export the annotated data, the value of "sentence" gives you the original raw sentence – alternatively, you could also just add "sentence_id": 60 or whatever else you want to track.
If you're working with an interface like ner_manual to do actual entity highlighting, you can also set tokens to "disabled": true and make them unselectable – for example, all tokens in the previous and next sentence. This means that the annotator will see the context tokens in grey, but will only be able to annotate the current sentence (you can see an example of disabled tokens here).