Hello,
I must hold the "memory" of the previous sentences.
As per yours suggestion i split my long documents into paragraphs just looking at \n character.
At the moment i need to tag two custom entities that have the same structure of a date (dd-mm-yyyy) but, in a specific context, i use EXP_START and EXP_END instead of DATE label.
When i train the model with long documents, it works well, i also have changed the conv_depth to 8.
However to speed up the training process i will split the documents as i wrote above.
The problem is that EXP_START and EXP_END could be splitted in my Sentencizer so for example:
....word1 word2 word3 \n
from 01-01-2010 to 01-01-2012 \n
word1 word2 word3...
as you can see if i train the model passing the sentence "from 01-01-2010 to 01-01-2012" the model will never understand if it is a "simple" DATE entity or an EXP_START / EXP_END. It has no context.
So my question is, how can i try to hold the memory of the previous sentence to let the model understands better the context (for very small sentences ) ?
In this (stupid) example, i must take the previous word1 word2 word3 to understand if 01-01-2010 is a EXP_START ..etc
I would probably advise against this. Would the following approach work?
Mark everything as DATE
Use the text-classifier to recognize whether a region is one where the EXP range applies
Parse the dates with a rule-based process, and mark which one is earlier and which one is later
Neural networks are able to learn mathematical relations between quantities, but the model will always be kind of bad at this. You can always get some new word between them that disrupts the classification, causing an unexpected result.
If you know you have pairs of dates and you want to mark which one's the start and which ones the end, that should be very easy to program. Use normal logic for that part, and focus the machine learning on the aspects that are ambiguous: identifying the dates, and figuring out which regions of text express the relation you're interested in.
Thank you @honnibal
I will try your approach. What about the problem I said regarding the short sentences?
Example:
from 01-01-2010 to 01-01-2012 \n
The model will recognizes the DATE but without context will be difficult to improve the ner model, no? Maybe for DATE no because they have the same format less or more but for other class will be hard, what do you think?
Regarding the "region" you said, what do exactlymean? What segment of text should I have to use to train the text classifier for EXP label? Possibilities are:
N tokens before/next the DATE
Current sentence (but for short sentence will be hard I think.)