Soliciting some help! I have a task of identifying text spans that may range from one sentence to 3-5 consecutive sentences. I started with a span classification task and it didn't perform nicely. I then approached this problem as a sentence classification task: I split the entire text into sentences with corresponding labels (i.e., some sentences were labelled, and some were not). It worked well. However, the model lacks the global context, which can certainly improve the performance.
Similarly to a standard NER task, where some tokens are annotated and some are not, using a large context (e.g., a whole sentence) is helpful.
For example: ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]). Using only "Tokyo Tower" without the rest may be suboptimal.
So I am wondering whether sentence classification task may somehow be adapted to a NER-type approach, where instead of labels tokens, we can use labelled sentences (i.e., another granularity) with the hope that it will improve model's performance. Does it make sense?
From your post, I understand that you're broadly keen to classify sentences that take advantage of a larger context window i.e. the preceding and succeeding sentences.
Long story short, there’s nothing out-of-the-box that would be able to solve this for you. This is because the text classification architectures take a Doc object (however you define this) as input to produce a score for the label classes i.e. the whole context is the Doc object. Meanwhile, the SpanCategorizer tends to struggle with many-sentence-long spans so from the sounds of it, text classification is more appropriate.
Have you thought about “chunking” your input data, text classifying at the chunk level and then doing any necessary post-processing of the chunks? This potential solution is highly dependent on the nature of your data but might be an avenue for exploration. You could also play around with the different Text classification architectures to see which performs best on the chunks.
Otherwise, you would have to define your own model architecture that takes as features the sentences before or after.
Let us know what solution you end up coming up with!
I don't know how similar my problem is, but I had to determine the content of web pages. That boiled down to identifying and classifying the nouns (sorry, no details) in the page
I changed it into a textcat problem with a context-aware language model; rather than classifying the nouns classified the sentences; sentences about similar items (labels) were grouped together, so the labels must have been the same.
Why it worked (I think)? The content of the paragraph (that described one noun) was actually captured in a single sentence. The rest of the "context" was noise to the noun (and label).
How that applies to your problem? A noun is found in basically in a single sentence, and you can ignore the rest.
Remains how to split the paragraph into sentences. That's pretty simple, first filter out the segments between brackets, than split the remainder on the pattern ". ", "; ", " ! ", "? " (don't forget the space; you will split numbers!). That is all done with regex, no need for an expensive grammar-based senticizer like spaCy uses