Just a broader question. We have a number of text corpora that we have annotated in older 'Qualitative Data Analysis Software' like NVivo, MaxQDA, ATLAS.ti, Dedoose and others. These are essentially programs in which analysts manually highlight text spans, and then apply some code(s) from a coding scheme. Most of these programs allow exports as csv or xml files, which could probably be reshaped as jsonl files.
But so I'm just curious - can any of these manual annotations also be imported into prodigy as 'seeds' for active learning? Or in any other way? Has anybody ever done anything like that and did it work? Any suggestions would be most welcome!
Hey Stephan -
I do a lot of work with social scientists who have done a lot of QDA & qualitative coding. You're right that what you describe is technically possible - I've exported data from NVivo and reshaped it into something practical for ML/NLP.
Where difficulty arises is the alignment between tasks and the objective/business problem/research question. The ultimate question is: let's say what you propose is possible - how would doing NLP on the data you have get you closer to your research goal?
From there, then you've got some alignment between the tasks you have data for and things like Named Entity Recognition or Text Classification. For example, if you have a bunch of additional text and you want to identify paragraphs within the text that someone might want to do further qualitative coding on - I'd treat that as a classification problem at the paragraph level. You could take all the documents you already have, classify them as "relevant" if they were highlighted at all and "irrelevant" if they have no highlights and train a model from there.
Another level more granular might be training a system to automatically code portions of the text in alignment with the existing codebook or coding rules. I've traditionally struggled with this problem in the same way @honnibal describes in this tweet thread: https://twitter.com/honnibal/status/1111990886483853312. That is, often times the spans annotated in qualitative coding span multiple sentences or paragraphs, and don't align neatly with something more granular like NER.
In summary, something can definitely be done but there's not great alignment between tasks. I'd start with text classification, being sure you think about how the results of that are going to be used within your current business problem / research question.