I have documents containing multiple dates, and those dates have particular significance in a certain context. For example, if I have a sentence that says “We will start building the house on September 1, 2010” and another sentence that says “We will finish building the house on October 3, 2012” I want to annotate only “September 1, 2010” as being an entity of type STARTING_DATE
.
Currently I am framing this as a named entity extraction problem in which I am trying to tag a custom STARTING_DATE
entity. I created seed patterns for this that rely on both surface features of the text spans and ENT_TYPE = DATE
properties assigned by spaCy’s out-of-the-box named entity detector, and use these to run the ner.teach
recipe.
It seems like the model I’m training is simultaneously learning to perform two tasks
- identify certain text spans as dates
- distinguish those dates that appear in the contexts I care about.
Though my usual inclination is to prefer joint learning over sequential learning, there is a case to be made for breaking these two tasks apart. The seed patterns I’m using already have good precision and recall so there’s not much point in having my model learn to generalize them. And even if I do want to generalize them, the task of identifying dates (e.g. learning to recognize words like “January” or “October”, or suspect that a sequence of four-digits beginning with “19” or “20” is a year) is a common one, whereas the context recognition task in (2) is peculiar to me, so there is a lot less training data for it. It might be more effective to recognize candidate dates just with pattern matching, and then treat my task as a binary text classification of candidate-date-plus-context into either STARTING_DATE
or NOT_STARTING_DATE
. Essentially I want to address the date identification task in (1) with transfer learning via the NER models spaCy already contains.
I can do this in Prodigy by creating a corpus of pattern-detected candidate dates and their contexts and annotating this with the textcat.teach
recipe, but this is a little hard on the annotator because the first thing they have to do is skim the text looking for the date. I think it would be really helpful to have that text highlighted.
Is there any way to do text classification annotation with some spans of the text highlighted? I was looking at the custom recipes documentation, but it seems like this might require an annotation interface that doesn’t exist.