A couple of questions. When you say "issues", could you also call them "topics" about issues on the assignment of public lands in Patagonia? Do you have experience and prior expectations of words/terms you're looking for?
If so, then yes! Text classification with an initial pattern (term lists, aka ontology) to leverage your expertise sounds like a great plan!
There's a great demo video on identifying insults on Reddit that could help.
The video is a little old so some of the syntax may have changed but the idea is the same. Start with 1 topic that you know of (e.g., perhaps "privatization") and turn it into a binary problem (privatization vs. not privatization). For the terms, you can either make a patterns jsonl file or use the terms.teach recipe to suggest related terms. Then you can use the textcat.teach recipe and the --patterns parameter to pass the patterns.jsonl to filter out examples containing terms and phrases. The matched spans are also highlighted when labelling.
One downside of this approach is that you likely would expect/want to identify multiple issues/topics, not 1 topic. However, I would recommend starting with only a binary because you'll likely build a good classifier on this topic much quicker. Then expand it to multiple classes.
Here's a great discussion on why and strategies to expand:
Just curious - do you have any problems with converting the text into txt/jsonl files? I suspect some of these newspapers may have been digitized (e.g., scans of pdf documents) or maybe you've used OCR. If so, there are some options for pdf documents but sometimes this can be tricky depending on the documents.
Also, have you thought about the length of your documents? Typically sentences are the best way to start but paragraphs can also make sense too. To do this pre-processing (sentences or paragraphs) you can create a small script. Let me know if you have questions on how to do this.
If you're labeling on a paragraph or document level, you may also want to consider using a multilabel_textcat when training your model as topics/issues would not be mutually exclusive. This would enable the model to make predictions of multiple topics rather than only one topic at a time.
My newspapers are in pdf... so I will use
Python-tesseract is an optical character recognition (OCR) tool for python
Also I know a free software gimagereader that use the library tesseract...
and the last question..
what do you think about the use for my corpus of NER , for example
to identify the main key players, names of people, organizations, events or places in the newspapers.
and of course use text classification..
Thanks very much
Yes, NER makes sense. I would start first with the spaCy pretrained NER entity types. You can filter out irrelevant entity types. I suspect some of the pretrained entity types will work well in general (e.g., PERSON, GPE, ORG, LOC). If you find the base model doesn't work, then you can modify relevant entities for better performance on your dataset. But if you try to fine tune or add in new entity types, be aware that you may have issues of catastrophic forgetting when retraining.
Keep us informed on how the project goes and let us know if you have any further questions!