Hi, I am trying to recognize entities in a set of OCR texts from images of documents. Since the text is commonly in the form some_label: value
in the document, it comes up often (but not always) in the OCR text as well.
My question is, say I am trying to annotate dates in my OCR text files, and 80% of times the date is in the format Date: xx/xx/xxxx
; would it better if I ...
- Only marked
xx/xx/xxxx
as my date entity- Represents the true entity
- Would be representative of 100% of the data
-
OR marked the entire
Date: xx/xx/xxxx
as my date entity- Would take advantage of the commonly occurring
Date:
prefix for better accuracy?
- Would take advantage of the commonly occurring
Another example:
Amounts are commonly represented as $xxxxxx5.37
and $ 63.75
- Choose
5.37
and63.75
- Choose
$xxxxxx5.37
and$ 63.75
(taking advantage of $ sign)
Which of these would be the better practice to follow / lead to a better model?
(P.S.: Also asked on Stack Overflow: https://stackoverflow.com/questions/65134056/ner-should-i-include-common-prefixes-in-labeled-entities)