Text docs size of inference vs training

ingvarvg · March 14, 2023, 8:26am

Coud you clarify how training docs' sizes (e.g text lengths) ought be related to inference docs sizes. Texts are split on paragraphes for annotators to have relatively small text pieces to work with. These text pieces are used for training. Each text piece is a paragraph of an average size ~800 chars.
For the inference in production, should the whole text be send to the model or shall it also be split on paragraphs if there is any internal biases in the model after training on shorter text pieces ?

koaning · March 22, 2023, 11:05am

Hi there!

In general: you want the validation set to mimic your use-case as good as possible. If you're going to apply the model on sentences, it makes sense to validate in on sentences. This typically also means that the training set would have to follow suit.

That said, if you can share some more details about your task then I might be able to give more precise advice.

ingvarvg · March 23, 2023, 11:59am

Thank you for advices.
Use case - detect entities in a text of arbitrary length.
Training is dones on pieces of text split for annotators convenience. Dev and test sets follow the same principle.
We could merge multiple Docs into several bigger pieces that would be more representative of the production use-cases, depending if it matters for the model ?

koaning · March 23, 2023, 2:12pm

But what kinds of entities are you trying to detect? What kind of text are you dealing with? Newspaper articles?

We could merge multiple Docs into several bigger pieces that would be more representative of the production use-cases, depending if it matters for the model ?

This might depend on the use-case too. I know of folks who are interested in doing sentiment analysis on large documents, but to make the problem simpler they run a classification model on each sentence. Then they count how often sentences are "negative" to declare a document to be negative.

Again, a lot of my advice would depend on your use-case. If you could share a bit more about the task I might be able to give better advice.

Topic		Replies	Views
Splitting bigger documents for NER usage , ner , best-practices	1	942	March 30, 2022
Annotate passages in long documents	1	576	July 28, 2022
Extracting useful information from Job description ner , textcat , spancat	1	1544	January 24, 2023
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
textcat - by sentence or by whole document (3-5 paragraphs) textcat	3	665	November 25, 2019

Text docs size of inference vs training

Related topics