Training the model and hold the memory of the previous sentences

damiano · August 8, 2019, 3:35pm

Hello,
I must hold the "memory" of the previous sentences.
As per yours suggestion i split my long documents into paragraphs just looking at \n character.

At the moment i need to tag two custom entities that have the same structure of a date (dd-mm-yyyy) but, in a specific context, i use EXP_START and EXP_END instead of DATE label.
When i train the model with long documents, it works well, i also have changed the conv_depth to 8.

However to speed up the training process i will split the documents as i wrote above.

The problem is that EXP_START and EXP_END could be splitted in my Sentencizer so for example:

....word1 word2 word3 \n
from 01-01-2010 to 01-01-2012 \n
word1 word2 word3...

as you can see if i train the model passing the sentence "from 01-01-2010 to 01-01-2012" the model will never understand if it is a "simple" DATE entity or an EXP_START / EXP_END. It has no context.

So my question is, how can i try to hold the memory of the previous sentence to let the model understands better the context (for very small sentences ) ?

In this (stupid) example, i must take the previous word1 word2 word3 to understand if 01-01-2010 is a EXP_START ..etc

Thanks

honnibal · August 12, 2019, 7:41am

I would probably advise against this. Would the following approach work?

Mark everything as DATE
Use the text-classifier to recognize whether a region is one where the EXP range applies
Parse the dates with a rule-based process, and mark which one is earlier and which one is later

Neural networks are able to learn mathematical relations between quantities, but the model will always be kind of bad at this. You can always get some new word between them that disrupts the classification, causing an unexpected result.

If you know you have pairs of dates and you want to mark which one's the start and which ones the end, that should be very easy to program. Use normal logic for that part, and focus the machine learning on the aspects that are ambiguous: identifying the dates, and figuring out which regions of text express the relation you're interested in.

damiano · August 12, 2019, 11:35am

Thank you @honnibal
I will try your approach. What about the problem I said regarding the short sentences?
Example:
from 01-01-2010 to 01-01-2012 \n

The model will recognizes the DATE but without context will be difficult to improve the ner model, no? Maybe for DATE no because they have the same format less or more but for other class will be hard, what do you think?

Regarding the "region" you said, what do exactlymean? What segment of text should I have to use to train the text classifier for EXP label? Possibilities are:

N tokens before/next the DATE
Current sentence (but for short sentence will be hard I think.)
Previous | current | next sentences

What do you think?
Thanks

Topic		Replies	Views
Custom sentence boundaries detection usage , spacy	10	1676	June 27, 2019
Sentence segmentation in NER.teach ner , spacy , solved , legal	2	824	March 10, 2020
should i include the context before and after an entity i want ner , solved	7	551	December 17, 2019
Sentence fragments in context for classification labeling task. ner , textcat , front-end	1	436	September 8, 2020
Best approach for using ner manual and mark usage , ner , solved	22	2345	January 20, 2020

Training the model and hold the memory of the previous sentences

Related topics