Dealing with Sparse data

Jashmi1 · May 1, 2018, 6:10pm

Hi,

I really love the cool additions to the new prodigy version!

My question today is about sparse data. I have about 400 30-paged text files which talk about oil fields. I am trying to extract location where this field is located at. Unfortunately, this is mentioned in one line across a 30 page doc or its never mentioned at all.
Now while annotating, my accept:reject ratio is super small.

Is there a way that I can improve this number or rather, find an easier way to find spans that contain info about the locations?

Any advice on this topic is appreciated! thank you so much in advance!

honnibal · May 7, 2018, 12:38pm

I think the best solution is to have some custom logic pre-processing the data, so that you can make the questions more relevant. This could be simple rule-based logic, that takes advantage of the document structure for your specific data. You would make a custom recipe to do this, using the process described here: https://prodi.gy/docs/workflow-custom-recipes

If it’s hard to come up with a good rule-based pre-filter, you can also create a statistical one. For instance, you might cut your document into paragraphs, and train a textcat model to figure out whether the paragraph is a good candidate for annotation. There’s always a trade-off between labelling the paragraphs with a classification scheme that’s easy to learn, and one which does exactly what you need. For instance, you might have long sections of the documents that are concerned with equipment or personnel issues. It might make sense to apply those labels, even if what you care about is “Does it mention an oil field?” — that label might be harder to learn, as it’s more specific.

Topic		Replies	Views
what is best way to to extract paragraph or long sentences in a text document? usage	18	3697	August 9, 2020
Annotation for Argument Mining usage , custom , solved	17	2200	June 29, 2018
annotating entities in text documents usage , ner , solved	15	9934	November 28, 2017
Determining the best annotation pipeline for our scenario usage , ner , best-practices	5	1017	April 29, 2019
categorise documents for inclusion/exclusion	3	519	April 28, 2022

Dealing with Sparse data

Related topics