visualisation text classification results | print-stream and extraction of text

Hello :wave: ,

I have created a text classification project in order to detect certain types of paragraphs and to extract them in the future.

So I annotated my json corpus and trained a first model in Prodigy. I have very good results for 4 labels out of 7, the others are less good because of a lack of data. So I will look for new data to improve the scores.

I would like to see the results of the model using the print-stream command. I can see the text of my test but nothing is highlighted. In my first NER model the items were highlighted, can I have the same visualization with my text classification model?

Another question: my goal is to detect and classify these paragraphs but also to extract the text. Can I do this once I switch to Spacy? What is the command to use to extract the text of the classified paragraphs please?

Thank you ! :slight_smile:

Hi! The print-stream workflow should also work for text classification models, if you provide a pipeline with a component textcat that predicts doc.cats. The highest-scoring category will then be displayed next to the text. It's a bit different from the NER visualization, which outputs spans.

Alternatively, you could also use your own script, process every incoming text with your model and output the doc.cats. Visualising text categories is a bit easier because they're just top-level labels + scores.

Could you clarify what you mean by "extract the texts", do you have an example?

Hi :slight_smile: !

Thank you I will try that !

We have created this text classification model to detect and classify certain paragraphs in our texts.
In addition to getting their classification I would like to extract the text for example :

beginning of the document, several paragraphs.............................................

"CLAUSE DE MOBILITE : Votre lieu de travail ne constituant pas un Ă©lĂ©ment essentiel de votre contrat de travail, la SociĂ©tĂ© se rĂ©serve la possibilitĂ©, en fonction des nĂ©cessitĂ©s de l'entreprise, de vous muter sur les diffĂ©rents Ă©tablissements de la zone gĂ©ographique Paris Ile-de-France existants ou qui pourraient ĂȘtre crĂ©es a l'avenir. G.I.E. du Groupe AVIVA FRANCE. Siege social : 80 avenue de l'Europe. 92270 Bois-Colombes Groupement d'IntĂ©rĂȘt Economique rĂ©gi par l'ordonnance du 23 septembre 1967 au capital de 1525 E\n315 597 500 RCS Nanterre. Votre refus d'accepter un tel changement serait susceptible d'entrainer votre licenciement pour cause rĂ©elle et sĂ©rieuse."
--- Clause Mobilité.
Rest of the document, several paragraphs..............................................

I want to be able to retrieve the label and the corresponding text, in the idea of the example I just presented above.
Is this possible?

To be more precise, I have long documents, several paragraphs long. In these documents, I want to detect only some paragraphs and extract the text.
To reach my goal I created a corpus with all the paragraphs that interest me and only those, I annotated with textcat.manual according to my different labels (a paragraph can have several labels). I trained my model which gave very good results.
But, I wonder if it was the right method because I gave only the paragraphs I was interested in (they are part of larger documents).
Will the model I trained be able to detect these paragraphs and classify them if I give the whole document as input? And extract the text of those classified paragraphs ?

Thank you :slight_smile: !

The text classifier will predict labels over the whole text you give it, based on the information of the text. If you've trained it on single paragraphs, you should also run it over single paragraphs at runtime. If you give it whole documents, it will classify the whole documents and you will probably see worse results if it's only ever seen paragraphs during training. (And it won't just magically be able to divide your documents because that's not really what text classification is about.)