i am training to identify the start date of a contract. so should i highlight the text before and after it, because that will teach the model to recognize the key context words? for example,
You should only highlght the entity, not the context. The model will read the surrounding text, so you don't have to mark it. What you're telling the model when you highlight is, "Predict this span as an entity". You want to make sure your annotations are consistent, so that the model doesn't get confused.
Here's a quick summary of how the model works: we first read the text and come up with a meaning representation for each token in the text (in technical terms: we apply a convolutional neural network to calculate token vectors). We then go over the words left-to-right, and decide whether to start a new entity. If an entity has begun, we decide whether to continue it, based on the current word, the first word of the current entity, and the last word of the current entity.
OK, great; and thanks for the reply!
do you believe that ner.manual is the best recipe for us?
again, our goal is eventually to extract the start/end dates, payment sums, lessor/lessee names, etc from unseen documents.
i think that by current word, you mean the right-most, as yet unclassified word.
you only need information about the first and last words of the current entity? what about the middle words? they are not important?
many times an entity appears out of context aside from the places it appears in context. by that i mean, for instance, the start date in a contract sometimes appears in a suggestive context, such as "...shall begin on February 1, 2012, and continue until..."; but other times that same date will appear in the same document but with a less obvious context, such as "including any hook-up charges as of February 1, 2012". Should I highlight the dates in both contexts?
Here is an example of the start date out of obvious context.
do you think we should use the rule-based matching?
I'm sorry but we can only give quite limited amounts of project advice. We do try to point people in the right direction, but at the end of the day each project will be different.
I do think the second of the two images you posted looks more correct, and it's possible you should use rule-based matching --- but ultimately it's up to you.