I have created a text classification project in order to detect certain types of paragraphs and to extract them in the future.
So I annotated my json corpus and trained a first model in Prodigy. I have very good results for 4 labels out of 7, the others are less good because of a lack of data. So I will look for new data to improve the scores.
I would like to see the results of the model using the print-stream command. I can see the text of my test but nothing is highlighted. In my first NER model the items were highlighted, can I have the same visualization with my text classification model?
Another question: my goal is to detect and classify these paragraphs but also to extract the text. Can I do this once I switch to Spacy? What is the command to use to extract the text of the classified paragraphs please?
Hi! The print-stream workflow should also work for text classification models, if you provide a pipeline with a component textcat that predicts doc.cats. The highest-scoring category will then be displayed next to the text. It's a bit different from the NER visualization, which outputs spans.
Alternatively, you could also use your own script, process every incoming text with your model and output the doc.cats. Visualising text categories is a bit easier because they're just top-level labels + scores.
Could you clarify what you mean by "extract the texts", do you have an example?
We have created this text classification model to detect and classify certain paragraphs in our texts.
In addition to getting their classification I would like to extract the text for example :
beginning of the document, several paragraphs.............................................
I want to be able to retrieve the label and the corresponding text, in the idea of the example I just presented above.
Is this possible?
To be more precise, I have long documents, several paragraphs long. In these documents, I want to detect only some paragraphs and extract the text.
To reach my goal I created a corpus with all the paragraphs that interest me and only those, I annotated with textcat.manual according to my different labels (a paragraph can have several labels). I trained my model which gave very good results.
But, I wonder if it was the right method because I gave only the paragraphs I was interested in (they are part of larger documents).
Will the model I trained be able to detect these paragraphs and classify them if I give the whole document as input? And extract the text of those classified paragraphs ?
The text classifier will predict labels over the whole text you give it, based on the information of the text. If you've trained it on single paragraphs, you should also run it over single paragraphs at runtime. If you give it whole documents, it will classify the whole documents and you will probably see worse results if it's only ever seen paragraphs during training. (And it won't just magically be able to divide your documents because that's not really what text classification is about.)