Dealing with Sparse data


I really love the cool additions to the new prodigy version!

My question today is about sparse data. I have about 400 30-paged text files which talk about oil fields. I am trying to extract location where this field is located at. Unfortunately, this is mentioned in one line across a 30 page doc or its never mentioned at all.
Now while annotating, my accept:reject ratio is super small.

Is there a way that I can improve this number or rather, find an easier way to find spans that contain info about the locations?

Any advice on this topic is appreciated! thank you so much in advance!

I think the best solution is to have some custom logic pre-processing the data, so that you can make the questions more relevant. This could be simple rule-based logic, that takes advantage of the document structure for your specific data. You would make a custom recipe to do this, using the process described here:

If it’s hard to come up with a good rule-based pre-filter, you can also create a statistical one. For instance, you might cut your document into paragraphs, and train a textcat model to figure out whether the paragraph is a good candidate for annotation. There’s always a trade-off between labelling the paragraphs with a classification scheme that’s easy to learn, and one which does exactly what you need. For instance, you might have long sections of the documents that are concerned with equipment or personnel issues. It might make sense to apply those labels, even if what you care about is “Does it mention an oil field?” — that label might be harder to learn, as it’s more specific.