Label lines of text in a sequence

Hi everyone.

I have a series of long texts. I need to build a model, which detects if a line is a heading. We plan to build and RNN that will categorize every line based on input data and context.

Now we need to label the texts. We have to label the text line by line. We need to have the whole text on the screen because context matters a lot.
And we need to preserve whitespaces and fixed width font, but it’s easy to do with CSS.

The question is can Prodigy annotate lines of text in a sequence of lines? So that the whole text is visible and we annotate it line by line.

How about something like this:


for i in range(len(lines)):
    prev_context = '\n'.join(lines[i-5 : i])
    target = lines[i]
    next_context = '\n'.join(lines[(i+1) : i+1+5])
    task = {
        'text': '\n'.join((prev_context, target, next_context)),
        'spans': [{'start': len(prev_context) + 1, 'end': len(prev_context) + 1 + len(target)}],
        'label': 'HEADLINE',
        'line_number': i
    }
    print(json.dumps(task))

You could use this script to print out the lines, and then pipe that forward into prodigy mark to get the annotation tasks.

The recipe given will highlight each line with 4 lines of previous and surrounding context. The lines come in order, so the annotator will be able to remember the wider document context to make the decisions.

If you need the whole document context visible at the same time, and the decisions editable at the same time, loading the data into a CSV spreadsheet is likely to be the best you can do.

Usually I would advise against insisting that too much context be displayed at once around the decision. You can’t read everything at once — you have to be looking somewhere, and the annotation will usually move faster if a smaller amount of text is displayed at once, in a larger font, with a single decision highlighted.

The other factor to consider is that for almost all text-processing tasks, the relevance of previous context decays exponentially with the distance from the target word. Most models build in this inductive bias to some extent. In spaCy’s default NER system, there’s actually quite a narrow context window: 4 words on either side of the target. So, you won’t be able to learn from arbitrarily distant contexts with the default NER model. You’ll need to swap in your own NER model, likely based on BiLSTM features.

1 Like

Alternatively, you can also use the ner.manual recipe (see here for the demo) and label longer sequences. A spaCy model can be used to pre-tokenize the text, so that the selection can “snap” to token boundaries. This makes the process more efficient, because your annotators don’t need to do pixel-perfect selection.

Prodigy will always preserve whitespace by default, since this is also very much in line with spaCy’s philosophy. The UI will also show all whitespace tokens (space, tab, newline) as symbols (·, , ), so you always know where they are and can explicitly select or unselect them. You can also mark certain tokens as “disabled”, to prevent the annotators from selecting them (for example, line breaks).

1 Like

@honnibal @ines you are fantastic. Thank you for the ideas.
Matthew’s idea will not work for us as we actually have to see the whole text to make a decision about annotating each line. I understand that importance/relevance decreases with distance, but I want LSTM to figure out how much to remember.

Ines’s idea about tokenizing by lines sounds great. I will give it a go. Not sure how exactly to do that, but I’ll check the docs.
I’ll share my results.

Thanks again!

Cool, definitely keep us updated! :+1:

This will all be taken care of automatically when you run the ner.manual recipe and pass in a spaCy model. The model will only be used for tokenization here, and if you need to adjust some of the rules to your domain, you can edit the tokenizer however you want, save it out, and use the modified model instead.

prodigy ner.manual your_dataset en_core_web_sm your_data.jsonl --label LABEL_ONE,LABEL_TWO

You can find more details on this in yourPRODIGY_README.html, including the API docs for the add_tokens function. Alternatively, you can also load in pre-tokenized data that looks like this:

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
    ]
}