Highlighting spans during text classification annotation

I have documents containing multiple dates, and those dates have particular significance in a certain context. For example, if I have a sentence that says “We will start building the house on September 1, 2010” and another sentence that says “We will finish building the house on October 3, 2012” I want to annotate only “September 1, 2010” as being an entity of type STARTING_DATE.

Currently I am framing this as a named entity extraction problem in which I am trying to tag a custom STARTING_DATE entity. I created seed patterns for this that rely on both surface features of the text spans and ENT_TYPE = DATE properties assigned by spaCy’s out-of-the-box named entity detector, and use these to run the ner.teach recipe.

It seems like the model I’m training is simultaneously learning to perform two tasks

  1. identify certain text spans as dates
  2. distinguish those dates that appear in the contexts I care about.

Though my usual inclination is to prefer joint learning over sequential learning, there is a case to be made for breaking these two tasks apart. The seed patterns I’m using already have good precision and recall so there’s not much point in having my model learn to generalize them. And even if I do want to generalize them, the task of identifying dates (e.g. learning to recognize words like “January” or “October”, or suspect that a sequence of four-digits beginning with “19” or “20” is a year) is a common one, whereas the context recognition task in (2) is peculiar to me, so there is a lot less training data for it. It might be more effective to recognize candidate dates just with pattern matching, and then treat my task as a binary text classification of candidate-date-plus-context into either STARTING_DATE or NOT_STARTING_DATE. Essentially I want to address the date identification task in (1) with transfer learning via the NER models spaCy already contains.

I can do this in Prodigy by creating a corpus of pattern-detected candidate dates and their contexts and annotating this with the textcat.teach recipe, but this is a little hard on the annotator because the first thing they have to do is skim the text looking for the date. I think it would be really helpful to have that text highlighted.

Is there any way to do text classification annotation with some spans of the text highlighted? I was looking at the custom recipes documentation, but it seems like this might require an annotation interface that doesn’t exist.

In theory this should work – as far as the web application is concerned, if you feed it a classification task with text and entity spans, those will be rendered within the text. (You can test this by running prodigy mark with --view-id classification. Under the hood, the classification interface will check the task and determine how to render it based on the available properties – for example, by default, text plus spans will result in highlighted spans, whereas image plus spans will result in highlighted regions on the image.)

However, I just had a look at the way Prodigy currently handles this, and I think there’s a tiny detail that currently makes this more difficult: even if you don’t annotate in “long-text mode” (i.e. let Prodigy pre-select sentences from long documents), the TextCategorizer currently resets a tasks "spans" property. So even if you feed in tasks that include the date entities, it looks like it will be removed when the model scores the stream. There might be a reason we’ve decided to implement it this way – but I’m not 100% sure, so we’ll have to test this.

In the meantime, you could try using your own score function instead of the model’s __call__ method, and preserve the original task’s spans. All you’d have to do here is use spaCy to process the texts using the pipe method. If you set as_tuples=True, you can pass in additional context like the full annotation task. I haven’t tested this yet, but something like this should work:

def score_stream(stream, model, label):
    texts = [(eg['text'], eg) for eg in examples]
    data = model.nlp.pipe(texts, batch_size=32, as_tuples=True)
    for doc, eg in data:
        score = doc.cats[label]          # get score for category
        task = dict(eg)                  # copy annotation task
        task['label'] = label            # add label to task
        task['score'] = score            # add score
        task['priority'] = score         # add priority
        task['meta'] = {'score': score}  # add meta (for UI only)
        yield score, task                # yield (score, task) tuples

The tuples yielded by this function can be passed into one of the sorters, e.g. prefer_uncertain. The model’s update method will only look at the text and ignore the spans, so you should be able to use them for presentational purposes only – i.e. to highlight the entities and make it easier for the annotators. (Note that this will only work if the long-text mode isn’t enabled – otherwise, Prodigy will use the spans to highlight the relevant sentences within the long texts.)

How do I specify “long-text mode” versus other modes? Is this a command-line option, or does changing this custom split-sentences logic in the recipe like you describe here?

This is the --long-text or -L argument on the command line. It's then passed through to the TextClassifier as the long_text keyword argument. It defaults to False, so if you don't set it, long-text mode should be disabled.

An even more basic usage question: how do I feed the web application spans along with text? So far I’ve just been using jsonl files with a “text” field to pass in my corpora. I’m not sure how to add spans to this, or if I have to do something with db-in.

There are several ways you can do this. Ultimately, you need create data that looks like this:

{
    "text": "This contract shall end on October 3, 2012",
    "spans": [{"start": 27, "end": 42, "label": "DATE"}]
}

Depending on your data, you can either pre-process your JSONL stream, filter out the examples containing entities (using a model or the pattern matcher), save them out to a new JSONL file and load it in with Prodigy. Alternatively, you can also write a function that yields annotation tasks.

If you want to use a spaCy model to extract the entities, the function could look like this:

def add_spans_from_model(nlp, stream, label='DATE'):
    for eg in stream:
        doc = nlp(eg['text'])
        for ent in doc.ents:
            if ent.label == label:
                task = dict(eg)
                task['spans'] = [{'label': label, 'start': ent.start_char, 
                                  'end': ent.end_char}]
                yield task

In addition, you can also extract the entities based on your patterns – see the source of ner.match for an example of using the PatternMatcher model. The code you need should be as simple as:

model = PatternMatcher(spacy.load(spacy_model)).from_disk(patterns)
stream = (eg for score, eg in model(stream))

If you want an even more advanced solution, you could also create a helper recipe or Python script that takes a stream, modifies it to add the entity spans and prints the individual tasks to the command line. You can then pipe this output forward to the next recipe, e.g. textcat.teach. If no source argument is specified, it will default to sys.stdin. This means you can pipe the output of one script forward to a recipe:

prodigy add-spans my_data.jsonl | prodigy textcat.teach my_dataset en_core_web_sm --label STARTING_DATE 
@prodigy.recipe('add-spans')
def add_spans(file_path):
    stream = prodigy.components.loaders.JSONL(file_path)
    stream = add_spans_to_stream(stream)  # some function that adds spans
    for task in stream:
        print(json.dumps(task))  # dump and print JSON task

But this might be a bit much considering you’re still in the exploration phase. But it could be quite cool once you’ve figured out your ideal workflow and want it to work even more smoothly :blush:

I want to highlight the keywords while have an model select the most uncertain instances. I don't want to use the pattern matcher because all my sentences contain the keywords. I thought I could use the solution provided in this thread but it didn't work. My prodigy version is 1.10.4.

I used the textcat.teach recipe. In the source file I added the spans in my jsonl file. I've found that the "spans' attributes was replaced by an empty array when the jsonl was loaded. Here are more details -

Here is my prodigy command line -
prodigy textcat.teach test_db2 blank:en /mm/test.jsonl --label mm

Here is an example of the source file

{'answer': 'reject', 'meta': {'event_id': '6', 'pattern': '33'}, 'text': 'EVENTS SCHEDULED FOR THE DAY (GMT) 0700 DE Gfk Consumer sentiment for Mar: Expected -14.3; Prior -15.6 0745 FR Consumer confidence for Feb: Expected 92; Prior 92 1000 EZ Money-M3 annual growth for Jan: Expected 12.5%; Prior 12.3% 1000 EZ Economic sentiment for Feb: Expected 92.0; Prior 91.5 1000 EZ Industrial sentiment for Feb: Expected -5.0; Prior -5.9 1000 EZ Services sentiment for Feb: Expected -18.1; Prior -17.8 1000 EZ Consumer confid.', '_input_hash': -1753224771, '_task_hash': -1898714937, 'spans': [{'text': 'EZ Money', 'start': 167, 'end': 175, 'pattern': -1188243936}], 'label': 'mm', '_session_id': 'test_db2-fhou', '_view_id': 'classification'}

However in the prodigy log, I noticed the spans array become empty for this instance. And there is no highlight on UI.

The textcat.teach recipe uses the spans to highlight the keywords, so that information will be reset after loading your data. You could use a custom version of the recipe that adds them at the end – e.g., you can provide your spans in the data as "orig_spans" or something, and then add write them to the "spans" right at the end of the function before the stream goes out.

That makes sense. I'll try the custom recipe. Thank you very much for the quick reply.