Loading CSV data (public oil and gas drilling)

I have a dataset that is a CSV files with two columns and lots of rows (over 250K). I separated the data into training, validation and hold out. Total dataset size is about 250K rows like this example.

> SN_WAR,TEXT_REMARK
> -260554,"11/16/14 - Circ. Pump OOH w/ 12-1/4"" clean out assy f/ 11135'-3824'. POOH f/ 3824' to surface. RU & RIH w/ 9-7/8"" LNR f/ surface to 1353'. 11/17/14 - RIH w/ 9-7/8"" LNR on 6-5/8"" DP to 11132'. CBU. Set HGR. Cmt LNR w/ 499 sx, 543 cu ft, 16.4 ppg CL H. Reverse out & circ. 11/18/14 - POOH w/ 9-7/8"" running tool f/ 9606' to surface. MU & RIH w/ 8-1/2"" x 9 7/8"" BHA to 3239'. Pressure test 9-7/8"" x 13-5/8"" csg & BSR to 2750 psi, good test. Function test BSR, CSR & diverter. RIH f/ 3239'-8956', space out & pressure test BOP's w/ 5"" DP f/ 250/5500 psi. RIH to 10221', pressure test BOP's w/ 6-5/8"" DP f/ 250/5500 psi. 11/19/14 - Drill out float collar/cmt & 9-7/8"" shoe track increasing MW f/ 11.2 ppg to 11.4 ppg prior to drilling out shoe @ 11132'. Record SCR's & CLFP's. Drill out shoe & clean out rat hole to 11135'. Drill 10' of new formation to 11145'. CBU above BOP. Perform FIT w/ 11.4 ppg, 716 psi = 12.8 ppg EMW. Activate reamer 10' below shoe. Drill 8-1/2"" X 9-7/8"" hole f/ 11145'-11542' bit depth. 11/20/14 - Drill 8-1/2"" X 9-7/8"" hole section f/ 11542'-12390'. Unable to get back to btm @ 12390' due to pack off conditions. Rack back 1 std to 12367'. CBU, no gas on bottoms up. Increase MW f/ 11.4 ppg to 11.6 ppg, while attempting to work back to btm @ 12390' by varying pump rates f/ 150-500 GPM & adjusting rotary speeds f/ 30-120 RPM, no success, unable to pass 12387'. 11/21/14 - Increase MW f/ 11.4 ppg to 11.6 ppg, while attempting to work back to btm @ 12390' by varying pump rates from 150-500 GPM & adjusting rotary speeds f/ 30-120 RPM, no success, unable to pass 12387'. Down link MWD to neutral for POOH. Pump OOH to 9715'. POOH f/ 9715' to surface. MU 8-1/2"" BHA, RIH to 62'. 11/22/14 - MU & RIH w/ 8-1/2"" BHA to 1500'. Shallow test MWD. Function test BSR & CSR. RIH to 9-7/8"" shoe, break circ. RIH to 11383' tag w/ 20K. W&R to 12220', pump sweep & CBU. W&R f/ 12220-12390', unable to pass. Increase MW f/ 11.6 ppg to 11.8 ppg, while attempting to work back to btm @ 12390' @ various drilling parameters. Drill 4' of new formation to 12394'. Unable to work through & past 12394' due to pump pressure increase & high torque @ various depths between 12380' to btm @ 12394'."

To load the dataset and start manually labeling the dataset I have executed this command
prodigy ner.manual BOEM en_core_web_lg Training.csv --label entities.txt

But it returns this error.
ValueError: Error while validating stream: no first batch. This likely means that your stream is empty.

Any ideas on what I am doing wrong? If I change the extension to .txt it works but includes the SN_WAR field as well.

P.S. If anyone is looking for a large dataset of public oil and gas drilling language here is a link to it. This dataset represents the drillers comments from 56,000 oil well in the US
Raw drilling comments data file ~250mb
List if all the datasets available from BOEM

I think I know what the problem is: The built-in CSV loader expects the text to be in a column text or Text (see the “Input formats” > “CSV” section in the readme for an example). In your case, it’s probably easiest to just rename the column – but if you ever need a more custom solution, you can always write your own loader script.

Thanks for sharing – I always like those types of very specific datasets. It also looks very similar to a lot of the incident report analysis people are working on using spaCy and Prodigy.

We’d love to do a tutorial using similar data, but I guess the oil drilling stuff is a little too specific, and I just don’t know enough about it to do something meaningful here. (Although, I feel like I learned a lot from the Prodigy forums already, haha. If I remember correctly, there was someone else on here who shared some of their progress on oil drilling NER a while ago :smiley: )

Thanks for the help! I was so close I changed the column header to TEXT not Text :slight_smile: It is working now.

I did read the post you mentioned, you outlined a excellent way to approach a solution. As I get better at NLP I might make a video showing how to use prodigy to train a drilling specific model from the perspective of a domain expert as opposed to an NLP expert. It is a bit of a steep learning curve, and I am learning that the workflow is everything and having good patterns for the technical content can help enormously. Your response to “Using a handmade annotation file for model training” also points to knowing when a rules based system might be better then a machine learned system. Knowing which tools to use in what circumstances seem to be the key to doing this well. At times it feels more like an art than a science. I am really enjoying learning this topic and appreciate all the help you have given as I feel my way through this material. Thanks again.

1 Like