Text docs size of inference vs training

Coud you clarify how training docs' sizes (e.g text lengths) ought be related to inference docs sizes. Texts are split on paragraphes for annotators to have relatively small text pieces to work with. These text pieces are used for training. Each text piece is a paragraph of an average size ~800 chars.
For the inference in production, should the whole text be send to the model or shall it also be split on paragraphs if there is any internal biases in the model after training on shorter text pieces ?

Hi there!

In general: you want the validation set to mimic your use-case as good as possible. If you're going to apply the model on sentences, it makes sense to validate in on sentences. This typically also means that the training set would have to follow suit.

That said, if you can share some more details about your task then I might be able to give more precise advice.

Thank you for advices.
Use case - detect entities in a text of arbitrary length.
Training is dones on pieces of text split for annotators convenience. Dev and test sets follow the same principle.
We could merge multiple Docs into several bigger pieces that would be more representative of the production use-cases, depending if it matters for the model ?

But what kinds of entities are you trying to detect? What kind of text are you dealing with? Newspaper articles?

We could merge multiple Docs into several bigger pieces that would be more representative of the production use-cases, depending if it matters for the model ?

This might depend on the use-case too. I know of folks who are interested in doing sentiment analysis on large documents, but to make the problem simpler they run a classification model on each sentence. Then they count how often sentences are "negative" to declare a document to be negative.

Again, a lot of my advice would depend on your use-case. If you could share a bit more about the task I might be able to give better advice.