NLP project newspapers

Hello

I was reviewing the different documents about Prodigy.
And I had a question to ask about my work project.

My project is historical and conceives of analyzing a corpus of 1000 newspapers from 1929 to 1945.

And my main objective is to understand the vision that these newspapers had on the assignment of public lands in Patagonia.

I plan to work with text classification.

Use a show of 1 newspaper where you set these issues and then apply this text classification model to the rest of the newspapers.

Finally make a graph that shows the incidence of this means of graphic communication on these issues ...

I don't know if it's ok how I am raising it. And if it would be important to perform an ontology of terms before marking words?

Or maybe text classification is not the way to take.
Thank you
Gus

hi @gus!

Sounds like a fascinating project!

A couple of questions. When you say "issues", could you also call them "topics" about issues on the assignment of public lands in Patagonia? Do you have experience and prior expectations of words/terms you're looking for?

If so, then yes! Text classification with an initial pattern (term lists, aka ontology) to leverage your expertise sounds like a great plan!

There's a great demo video on identifying insults on Reddit that could help.

The video is a little old so some of the syntax may have changed but the idea is the same. Start with 1 topic that you know of (e.g., perhaps "privatization") and turn it into a binary problem (privatization vs. not privatization). For the terms, you can either make a patterns jsonl file or use the terms.teach recipe to suggest related terms. Then you can use the textcat.teach recipe and the --patterns parameter to pass the patterns.jsonl to filter out examples containing terms and phrases. The matched spans are also highlighted when labelling.

One downside of this approach is that you likely would expect/want to identify multiple issues/topics, not 1 topic. However, I would recommend starting with only a binary because you'll likely build a good classifier on this topic much quicker. Then expand it to multiple classes.

Here's a great discussion on why and strategies to expand:

Just curious - do you have any problems with converting the text into txt/jsonl files? I suspect some of these newspapers may have been digitized (e.g., scans of pdf documents) or maybe you've used OCR. If so, there are some options for pdf documents but sometimes this can be tricky depending on the documents.

Also, have you thought about the length of your documents? Typically sentences are the best way to start but paragraphs can also make sense too. To do this pre-processing (sentences or paragraphs) you can create a small script. Let me know if you have questions on how to do this.

If you're labeling on a paragraph or document level, you may also want to consider using a multilabel_textcat when training your model as topics/issues would not be mutually exclusive. This would enable the model to make predictions of multiple topics rather than only one topic at a time.

Once you get a model and apply it to the entire corpus, I'd recommend a bar chart race plot to tell the story of the evolution of these topics/issues over time. This video did the same at the end but for named entity recognition. There are other great temporal plots like streamgraphs where you can show the count of each issue/topics.

Thanks again for your question. Let me know how the project goes (and if any more questions come up)!

Hello Ryan
Thanks very much .
I will use term lists from here

but in spanish....
I will use this library

My newspapers are in pdf... so I will use
Python-tesseract is an optical character recognition (OCR) tool for python

Also I know a free software gimagereader that use the library tesseract...

and the last question..
what do you think about the use for my corpus of NER , for example
to identify the main key players, names of people, organizations, events or places in the newspapers.
and of course use text classification..
Thanks very much
Gus

That sounds great!

Yes, NER makes sense. I would start first with the spaCy pretrained NER entity types. You can filter out irrelevant entity types. I suspect some of the pretrained entity types will work well in general (e.g., PERSON, GPE, ORG, LOC). If you find the base model doesn't work, then you can modify relevant entities for better performance on your dataset. But if you try to fine tune or add in new entity types, be aware that you may have issues of catastrophic forgetting when retraining.

Keep us informed on how the project goes and let us know if you have any further questions!