hi @gus!
Sounds like a fascinating project!
A couple of questions. When you say "issues", could you also call them "topics" about issues on the assignment of public lands in Patagonia? Do you have experience and prior expectations of words/terms you're looking for?
If so, then yes! Text classification with an initial pattern (term lists, aka ontology) to leverage your expertise sounds like a great plan!
There's a great demo video on identifying insults on Reddit that could help.
The video is a little old so some of the syntax may have changed but the idea is the same. Start with 1 topic that you know of (e.g., perhaps "privatization") and turn it into a binary problem (privatization vs. not privatization). For the terms, you can either make a patterns jsonl file or use the terms.teach
recipe to suggest related terms. Then you can use the textcat.teach
recipe and the --patterns
parameter to pass the patterns.jsonl
to filter out examples containing terms and phrases. The matched spans are also highlighted when labelling.
One downside of this approach is that you likely would expect/want to identify multiple issues/topics, not 1 topic. However, I would recommend starting with only a binary because you'll likely build a good classifier on this topic much quicker. Then expand it to multiple classes.
Here's a great discussion on why and strategies to expand:
Just curious - do you have any problems with converting the text into txt/jsonl files? I suspect some of these newspapers may have been digitized (e.g., scans of pdf documents) or maybe you've used OCR. If so, there are some options for pdf documents but sometimes this can be tricky depending on the documents.
Also, have you thought about the length of your documents? Typically sentences are the best way to start but paragraphs can also make sense too. To do this pre-processing (sentences or paragraphs) you can create a small script. Let me know if you have questions on how to do this.
If you're labeling on a paragraph or document level, you may also want to consider using a multilabel_textcat
when training your model as topics/issues would not be mutually exclusive. This would enable the model to make predictions of multiple topics rather than only one topic at a time.
Once you get a model and apply it to the entire corpus, I'd recommend a bar chart race plot to tell the story of the evolution of these topics/issues over time. This video did the same at the end but for named entity recognition. There are other great temporal plots like streamgraphs where you can show the count of each issue/topics.
Thanks again for your question. Let me know how the project goes (and if any more questions come up)!