Prodigy Support

NER - Should I include common prefixes in labeled entities

burhan (Burhanuddin Lakdawala) December 4, 2020, 3:31pm 1

Hi, I am trying to recognize entities in a set of OCR texts from images of documents. Since the text is commonly in the form some_label: value in the document, it comes up often (but not always) in the OCR text as well.

My question is, say I am trying to annotate dates in my OCR text files, and 80% of times the date is in the format Date: xx/xx/xxxx; would it better if I ...

Only marked xx/xx/xxxx as my date entity
- Represents the true entity
- Would be representative of 100% of the data
OR marked the entire Date: xx/xx/xxxx as my date entity
- Would take advantage of the commonly occurring Date: prefix for better accuracy?

Another example:

Amounts are commonly represented as $xxxxxx5.37 and $ 63.75

Choose 5.37 and 63.75
Choose $xxxxxx5.37 and $ 63.75 (taking advantage of $ sign)

Which of these would be the better practice to follow / lead to a better model?

(P.S.: Also asked on Stack Overflow: https://stackoverflow.com/questions/65134056/ner-should-i-include-common-prefixes-in-labeled-entities)

honnibal (Matthew Honnibal) December 15, 2020, 1:31am 2

I think if it's that simple a transformation, it probably won't matter a great deal. You can consider making a rule-based adjustment from one annotation style to the other to check whether your models get better with either approach.

All else being equal, you might find it slightly more convenient to have your models predicting Date: as not part of the date entity, so you don't have to trim that part off as a post-process when you actually use the output.

1 Like

Topic		Replies	Views	Activity
Best practices for NER annotation ner , best-practices	2	787	March 16, 2021
partial word as entity usage , ner , solved	2	420	December 16, 2019
Understanding stock entity types and their application ner , spacy , best-practices	5	2037	September 30, 2019
Seeding named entity detection annotation with a pattern usage , ner , solved	11	4394	February 14, 2020
Best strategies to annotate long entities.	1	370	November 22, 2022