visualisation text classification results | print-stream and extraction of text

Lize · September 27, 2021, 8:56am

Hello ,

I have created a text classification project in order to detect certain types of paragraphs and to extract them in the future.

So I annotated my json corpus and trained a first model in Prodigy. I have very good results for 4 labels out of 7, the others are less good because of a lack of data. So I will look for new data to improve the scores.

I would like to see the results of the model using the print-stream command. I can see the text of my test but nothing is highlighted. In my first NER model the items were highlighted, can I have the same visualization with my text classification model?

Another question: my goal is to detect and classify these paragraphs but also to extract the text. Can I do this once I switch to Spacy? What is the command to use to extract the text of the classified paragraphs please?

Thank you !

ines · September 27, 2021, 2:55pm

Hi! The print-stream workflow should also work for text classification models, if you provide a pipeline with a component textcat that predicts doc.cats. The highest-scoring category will then be displayed next to the text. It's a bit different from the NER visualization, which outputs spans.

Alternatively, you could also use your own script, process every incoming text with your model and output the doc.cats. Visualising text categories is a bit easier because they're just top-level labels + scores.

Could you clarify what you mean by "extract the texts", do you have an example?

Lize · September 28, 2021, 7:46am

Hi !

Thank you I will try that !

We have created this text classification model to detect and classify certain paragraphs in our texts.
In addition to getting their classification I would like to extract the text for example :

beginning of the document, several paragraphs.............................................

"CLAUSE DE MOBILITE : Votre lieu de travail ne constituant pas un élément essentiel de votre contrat de travail, la Société se réserve la possibilité, en fonction des nécessités de l'entreprise, de vous muter sur les différents établissements de la zone géographique Paris Ile-de-France existants ou qui pourraient être crées a l'avenir. G.I.E. du Groupe AVIVA FRANCE. Siege social : 80 avenue de l'Europe. 92270 Bois-Colombes Groupement d'Intérêt Economique régi par l'ordonnance du 23 septembre 1967 au capital de 1525 E\n315 597 500 RCS Nanterre. Votre refus d'accepter un tel changement serait susceptible d'entrainer votre licenciement pour cause réelle et sérieuse."
--- Clause Mobilité.
Rest of the document, several paragraphs..............................................

I want to be able to retrieve the label and the corresponding text, in the idea of the example I just presented above.
Is this possible?

To be more precise, I have long documents, several paragraphs long. In these documents, I want to detect only some paragraphs and extract the text.
To reach my goal I created a corpus with all the paragraphs that interest me and only those, I annotated with textcat.manual according to my different labels (a paragraph can have several labels). I trained my model which gave very good results.
But, I wonder if it was the right method because I gave only the paragraphs I was interested in (they are part of larger documents).
Will the model I trained be able to detect these paragraphs and classify them if I give the whole document as input? And extract the text of those classified paragraphs ?

Thank you !

ines · October 1, 2021, 9:01am

The text classifier will predict labels over the whole text you give it, based on the information of the text. If you've trained it on single paragraphs, you should also run it over single paragraphs at runtime. If you give it whole documents, it will classify the whole documents and you will probably see worse results if it's only ever seen paragraphs during training. (And it won't just magically be able to divide your documents because that's not really what text classification is about.)

Topic		Replies	Views
Using the NER_manual interface to annotate text classification usage , textcat , front-end	4	412	September 14, 2022
Binary "pre-model" for faster annotation usage , ner , textcat	1	454	December 10, 2019
Textcat model with multiple classes usage , textcat	5	1536	November 1, 2019
Will NER improve Text Categorization?	2	413	July 18, 2022
textcat.print-stream behavior does not match documentation textcat , done	2	637	July 3, 2018

visualisation text classification results | print-stream and extraction of text

Related topics