Annotating infrequent occurrences

Hi everyone

I have a corpus (8000, and 15000 articles respectively) that I'm told has content about "COST" in the context of "the social cost of war is ...", "this war has put enormous financial strain on the economy of this country", etc. So anything but money.
Unfortunately phrases like this occur very infrequently. My thinking was that, since a corpus is just any collection of documents, I filter out the documents that have content that fits our definition of "COST" using regex. Compacted the corpus is (my thinking) just a method to increase the success rate of the annotations. The regex is at no point used in the model.

So my question, is this a valid method (can be justified), or is it violating all rules of proper NLP?



Hi Andreas,

drawing from my experience as an NLP researcher, there is no such thing as "proper" NLP etiquette in terms of data preprocessing. Corpus preprocessing is often all over the place and often doesn't really reflect a stringent discipline. So, whether this kind of preprocessing is accepted or raises eyebrows will depend on your audience and on your goal.

What do you want to do with the filtered documents? Do you need to further analyse the documents that do contain "COST" info or the ones that don't contain it?

All the best,

1 Like

Hi Andrea,

thanks for the response. Yes, I do need to do more with COST in a later phase, but right now it's difficult to get a better corpus.
Most importantly, you also see a document as having the property you're interested in, and a corpus as a mere random collection of these documents. Half a corpus is still a corpus, half a document is nothing

thanks for your input


Hi Andreas,

I think I'll need to prod a bit more to get a better understanding.

What is the end goal here? Are you undertaking e.g. a corpus linguistics study? Is your aim to build an NLP system for a specific task?

Also, who is your audience? Is it NLP people? Is it linguists? Sociologists?

The advice I would give you really depends on the answers to these questions. The problem with RegExp-based filtering is that you're limiting your scope massively. Natural language usually has a multitude of ways of formulating the same thing. Depending on what you want to achieve, RegEx-based pre-selection will be fine or not. If your goal is e.g. to build a classifier whether some text mentions your "COST" concept then filtering out the negative examples could be catastrophic. If your goal is to do a corpus-based study of e.g. how many news articles talk about this topic, then this kind of filtering (possibly combined with some syntactic parsing and selection based on syntactic roles) could be adequate, but you might also want to explore topic models.

I'm happy to discuss.

All the best,

Hi Andreas,

I think when you're working with heavy class imbalances, some sort of preselection strategy to intentionally bias the sample can be very helpful, and I do think it's a fairly common thing that people do.

The main thing you should be careful about is in your evaluation. Evaluating over the biased data might be misleading, because the data won't match what you see at runtime. You might also want to add back some of the negative class data later on, which should be very quick to annotate (you'll just be skipping through them quickly saying "reject").