NLP project newspapers

ryanwesslen · May 26, 2022, 9:43pm

Sounds like a fascinating project!

A couple of questions. When you say "issues", could you also call them "topics" about issues on the assignment of public lands in Patagonia? Do you have experience and prior expectations of words/terms you're looking for?

If so, then yes! Text classification with an initial pattern (term lists, aka ontology) to leverage your expertise sounds like a great plan!

There's a great demo video on identifying insults on Reddit that could help.

The video is a little old so some of the syntax may have changed but the idea is the same. Start with 1 topic that you know of (e.g., perhaps "privatization") and turn it into a binary problem (privatization vs. not privatization). For the terms, you can either make a patterns jsonl file or use the terms.teach recipe to suggest related terms. Then you can use the textcat.teach recipe and the --patterns parameter to pass the patterns.jsonl to filter out examples containing terms and phrases. The matched spans are also highlighted when labelling.

One downside of this approach is that you likely would expect/want to identify multiple issues/topics, not 1 topic. However, I would recommend starting with only a binary because you'll likely build a good classifier on this topic much quicker. Then expand it to multiple classes.

Here's a great discussion on why and strategies to expand:

Just curious - do you have any problems with converting the text into txt/jsonl files? I suspect some of these newspapers may have been digitized (e.g., scans of pdf documents) or maybe you've used OCR. If so, there are some options for pdf documents but sometimes this can be tricky depending on the documents.

Also, have you thought about the length of your documents? Typically sentences are the best way to start but paragraphs can also make sense too. To do this pre-processing (sentences or paragraphs) you can create a small script. Let me know if you have questions on how to do this.

If you're labeling on a paragraph or document level, you may also want to consider using a multilabel_textcat when training your model as topics/issues would not be mutually exclusive. This would enable the model to make predictions of multiple topics rather than only one topic at a time.

Once you get a model and apply it to the entire corpus, I'd recommend a bar chart race plot to tell the story of the evolution of these topics/issues over time. This video did the same at the end but for named entity recognition. There are other great temporal plots like streamgraphs where you can show the count of each issue/topics.

Thanks again for your question. Let me know how the project goes (and if any more questions come up)!

Topic		Replies	Views
text classification - is prodigy a good fit for the project? usage , textcat	2	678	October 22, 2019
Help needed to get started with text classification usage , textcat	10	3516	January 14, 2019
Classification of text into several topics usage , textcat	1	498	May 3, 2019
Classifying long-documents based on small spans of text usage , textcat , medical	3	821	February 11, 2021
Recommended approaches for combining NER with text calssification usage , ner , textcat	2	731	October 22, 2019

NLP project newspapers

Related topics