visualize a whole document (corpora) for text classification

Hi ! :smile:

I am creating a text classification project for which I need an annotation per document and not per paragraph. I have tested the --unsegmented command which does not seem to work for a text classification case.
What is the option to add to my command line to view my complete document when annotating?
Thanks to you !

Hi! Which recipe are you using and what file format are you loading it? The general text classification recipes shouldn't do any segmentation and they'll just show you whatever you load in. Just make sure you're using a data format that clearly defines where your documents start and end, e.g. JSON with one "text" per document. If you're using a .txt file, this can only be read in line-by-line because there's no way to know where a document with multiple paragraphs starts and ends.

By the way, in general, we'd still recommend breaking your documents up into smaller chunks like paragraphs or sections. Making the annotator read the whole document to classify it can be very inefficient and you only get one label per document. Another thing to keep in mind is that most model implementations for text classification wouldn't take context from the whole document into account anyway, and you often end up classifying sentences or paragraphs and averaging over their predictions. So there's often no advantage in annotating a whole document and you might be making your life much harder this way. Also see this section for more details: https://prodi.gy/docs/text-classification#whole-document

2 Likes

Hi :slight_smile:

I am using textcat.manual.
I already transform one file of my corpora in json and it's ok now I can see the whole document.

We want to annotate the whole document beacuse we need one label for one whole document.