I trained a custom NER model using prodigy & spacy for Date Span. However, the date predicted using the custom model has various format (of course, so does the training set). Is there a better way to consolidate date into a consistent format? Thanks.
I hope I understand your question correctly – but it sounds like you're looking for something like date normalization?
For example, there's
dateutil: https://pypi.org/project/python-dateutil/ If you wanted this to be more elegant, you could integrate it as a
Span extension attribute in spaCy so you can access it on the
Some of the work I did also required interpreting date, date spans, and durations.
We found that NER was good at spotting those and labelling them (e.g. start-date vs end-date in spans, or duration vs. date), but hit two issues.
The tokenization is not consistent between spaCy versions. Especially, from one version to another, spans like
Jan-Mar 2003would yield either 2 or 3 tokens (i.e.
Jan-Maras a single token or not). Depending on the language, the result would not be the same either. That defeats the purpose of trying to distinguish between start and end
Like you, the interpretation of the outputs are hard.
Time normalization we felt, is not suited for the task (we are not so much trying to do abstract math on dates, we are trying to make sense of "2012-08" or "10/12" - which is inherently and unavoidably ambiguous), so we basically used good old semantic rule engines.
No A.I. involved, we hard code dictionaries of values (names of months, long or abbreviated, ...), use those to label tokens by what they could possibly be (e.g. day of week, day in month, month, year), and select the best possible interpretation (e.g. a "day year month" series of token makes no sense ; and / or if two dates are next to each other, they should follow the same pattern ; and/or we favor interpretations that are coherent in the whole document, ...).
This is not prodigy related, though. That's... you know... good old software development.