How do I add technical terms and abbreviations

spacy
usage

#1

Hi,
I am a new Spacy/Prodigy user and also fairly new in NLP .
I want to use NLP to classify oil well drilling reports. These reports contain several abbreviations and technical terms, for example:

“POOH 8 1/2” BHA f/1800m to surface"
where:
POOH = Pull Out of Hole,
8 1/2" = should be recognized as a single entity (and " means inches)
BHA = Bottom Hole Assembly
f/1800 = from 1800 meters

My plan, so far is use the templates in space-dev-resources to extend the English language to include the most frequently used drilling terms, although I am not sure if that is the best approach.

What would be a reasonable workflow to follow?


(Ines Montani) #2

Thanks for your question! You don’t even necessarily need to adjust the English language, unless you need custom tokenization. This is mostly relevant if you need more splitting – for example, if you need “1800” to be an entity, but it’s not split into a single token. I just tested your example text and the out-of-the-box tokenization looks pretty good to me:

doc = nlp(u"POOH 8 1/2” BHA f/1800m to surface")
print([token.text for token in doc])
# ['POOH', '8', '1/2', '”', 'BHA', 'f/1800', 'm', 'to', 'surface']

I think a good first project would be to train the model to recognise a few specific drilling entities. This will later make it easier to classify whole texts and the process will also give you a better feeling for your data and what’s possible. If you haven’t seen it already, check out our video tutorial on training new entity types. I’ve also posted a more detailed comment here that goes though a workflow of training new entity types from scratch.

1. Create a patterns.jsonl file to find examples of the entities.

One of the more difficult parts is getting over the “cold start problem” and making sure the model sees enough positive examples in order to learn something about the new type. To make this easier, Prodigy lets you load in a file of match patterns, just like in the patterns for spaCy’s Matcher. Patterns can be exact strings, or dictionaries containing token attributes.

{"label": "DRILLING_TERM", "pattern": [{"orth": "POOH"}]}

(In my examples, I’ll call everything DRILLING_TERM, because I don’t know very much about oil well drilling :wink: But you’ll probably want to use more fine-grained categories here.)

You can also do more complex stuff here, like working with the token’s shape_ attribute or binary flags. For example, the shape of “f/1800” is "x/dddd" and like_num returns True for both “8” and “1/2”. When writing the patterns, make sure to check spaCy’s tokenization and verify that each token you describe is indeed split off correctly – otherwise, the patterns won’t match.

{"label": "DRILLING_TERM", "pattern": [{"shape": "x/dddd"}, {"orth": "m"}]}
{"label": "DRILLING_TERM", "pattern": [{"shape": "x/ddd"}, {"orth": "m"}]}
{"label": "DRILLING_TERM", "pattern": [{"like_num": true}, {"like_num": true}, {"orth": "\""}]}
{"label": "DRILLING_TERM", "pattern": [{"like_num": true}, {"orth": "\""}]}

The above patterns will match entities like f/1800, f/180, 8 1/2", 5" and so on. You could even take it one step further and add a pattern for all tokens consisting of three or four uppercase characters. It’s fine if those produce false positives as well – you’ll still be annotating those examples later on, and it’s actually quite valuable to give the model feedback on those. (You don’t want it to just learn that “all uppercase tokens are drilling terms”.)

{"label": "DRILLING_TERM", "pattern": [{"shape": "XXX"}]}
{"label": "DRILLING_TERM", "pattern": [{"shape": "XXXX"}]}

You might have to try out a few different options here and be a little creative.

2. Collect annotations of the new entity type(s) with the model in the loop.

Next, you can use the ner.teach with the --patterns argument pointing to your patterns file. This will tell Prodigy to find matches of those terms in your data, and ask you whether they are instances of that entity type. For example:

prodigy ner.teach drilling_dataset en_core_web_sm your_data.jsonl --label DRILLING_TERM --patterns /path/to/patterns.json

As you click accept or reject, the model in the loop will be updated, and will start learning about your new entity type. Once you’ve annotated enough examples, the model will also start suggesting entities based on what it’s learned so far. By default, the suggestions you’ll see are the ones that the model is most uncertain about – i.e. the ones with a prediction closest to 50/50. Those are also the most important ones to annotate, since they will produce the most relevant gradient for training. So don’t worry if they seem a little weird at first – this is good, because your model is still learning and by rejecting the bad suggestions, you’re able to improve it.

3. Train a model and see how it performs.

I’d suggest collecting a few hundred examples using ner.teach before running your first training experiments with ner.batch-train. Since you’re only clicking accept or reject, this shouldn’t take very long, though. If the results look promising, you can also run the ner.train-curve recipe, which will train on different portions of the data. This lets you see how your model is improving with more training data. If you see an increase in accuracy within the last 25%, it’s likely that your model will improve even further with more examples.

Ultimately, how things go and how many examples you’ll need to achieve decent results depends on your data. So you might have to experiment a little here.

4. Test your model and plan the next steps.

ner.batch-train will export a loadable spaCy model, so you can try it out and run it over some text. If training went well, you should now have a model that can recognise your custom drilling entities.

nlp = spacy.load('/path/to/your/model')
doc = nlp(u"POOH 8 1/2” BHA f/1800m to surface")  # some text
print(list(doc.ents))  # look at the entities that are recognised

What the next steps are of course depends on what you’re trying to do. For example, you might want to train a text classifier next, which assigns categories to the whole reports. Having a drilling-specific NER model will be very valuable here, because it can help you extract more specific examples for annotation and achieve better accuracy overall.

I hope this was helpful!


Combining NER with text classification
#3

Super helpful!
Thank you for taking the time to reply.
I’ll give it a go.