visualize a whole document (corpora) for text classification

Lize · August 26, 2021, 8:16am

Hi !

I am creating a text classification project for which I need an annotation per document and not per paragraph. I have tested the --unsegmented command which does not seem to work for a text classification case.
What is the option to add to my command line to view my complete document when annotating?
Thanks to you !

ines · August 26, 2021, 11:50pm

Hi! Which recipe are you using and what file format are you loading it? The general text classification recipes shouldn't do any segmentation and they'll just show you whatever you load in. Just make sure you're using a data format that clearly defines where your documents start and end, e.g. JSON with one "text" per document. If you're using a .txt file, this can only be read in line-by-line because there's no way to know where a document with multiple paragraphs starts and ends.

By the way, in general, we'd still recommend breaking your documents up into smaller chunks like paragraphs or sections. Making the annotator read the whole document to classify it can be very inefficient and you only get one label per document. Another thing to keep in mind is that most model implementations for text classification wouldn't take context from the whole document into account anyway, and you often end up classifying sentences or paragraphs and averaging over their predictions. So there's often no advantage in annotating a whole document and you might be making your life much harder this way. Also see this section for more details: https://prodi.gy/docs/text-classification#whole-document

Lize · August 30, 2021, 7:07am

Hi

I am using textcat.manual.
I already transform one file of my corpora in json and it's ok now I can see the whole document.

We want to annotate the whole document beacuse we need one label for one whole document.

Topic		Replies	Views
visualisation text classification results \| print-stream and extraction of text usage , textcat	3	430	October 1, 2021
Using the NER_manual interface to annotate text classification usage , textcat , front-end	4	414	September 14, 2022
textcat by sentence given context of larger document textcat	1	782	March 1, 2018
textcat.teach clarification docs , textcat , done	1	388	February 8, 2022
Text Categorization at Document level textcat , best-practices	3	1160	February 6, 2019

visualize a whole document (corpora) for text classification

Related topics