NER - Should I include common prefixes in labeled entities

Hi, I am trying to recognize entities in a set of OCR texts from images of documents. Since the text is commonly in the form some_label: value in the document, it comes up often (but not always) in the OCR text as well.

My question is, say I am trying to annotate dates in my OCR text files, and 80% of times the date is in the format Date: xx/xx/xxxx; would it better if I ...

  1. Only marked xx/xx/xxxx as my date entity
    • Represents the true entity
    • Would be representative of 100% of the data
  2. OR marked the entire Date: xx/xx/xxxx as my date entity
    • Would take advantage of the commonly occurring Date: prefix for better accuracy?

Another example:

Amounts are commonly represented as $xxxxxx5.37 and $ 63.75

  1. Choose 5.37 and 63.75
  2. Choose $xxxxxx5.37 and $ 63.75 (taking advantage of $ sign)

Which of these would be the better practice to follow / lead to a better model?

(P.S.: Also asked on Stack Overflow:

I think if it's that simple a transformation, it probably won't matter a great deal. You can consider making a rule-based adjustment from one annotation style to the other to check whether your models get better with either approach.

All else being equal, you might find it slightly more convenient to have your models predicting Date: as not part of the date entity, so you don't have to trim that part off as a post-process when you actually use the output.

1 Like