I’m not sure if I’m dealing with a bug or if I’m doing something wrong. But also if it’s the latter case you might be interested in why I’m doing this, so I’m going to write a quite verbose message here that will let you follow my thought process.
My goal is to train a new NER entity with the name DISASTER which will recognize for example floods, storms and volcano eruptions. Yesterday I followed your video in which you train a DRUG entity and got the results I wanted in the end. But now I’m struggling when the pattern of an entity consists of several words, like ‘volcano eruption’.
My first attempt was to create a patterns file with those multi-word tokens
{"label":"DISASTER","pattern":[{"lower":"volcano eruption"}]}
{"label":"DISASTER","pattern":[{"lower":"volcanic eruption"}]}
{"label":"DISASTER","pattern":[{"lower":"volcanic ash"}]}
{"label":"DISASTER","pattern":[{"lower":"ash"}]}
and use this to train the entity on news articles
prodigy ner.teach disasters_ner en_core_web_lg "eruption" --api guardian --label DISASTER --patterns volcano_patterns.jsonl
It finds the word ‘ash’ in news articles, but not the multi-word token.
So I dove deeper in the spacy documentation and found that those patterns need a different representation and updated volcano_patterns.jsonl
to
{"label":"DISASTER","pattern":[{"lower":"volcano"}, {"is_space": true}, {"LOWER": "eruption"}]}
{"label":"DISASTER","pattern":[{"lower":"volcanic"}, {"is_space": true}, {"LOWER": "eruption"}]}
{"label":"DISASTER","pattern":[{"lower":"ash"}]}
However this still doesn’t recognize the multi-word tokens.
Then I remembered that I read about ner.manual
and thought that I could use this to generate the valid patterns for the teaching. So I ran
prodigy ner.manual volcano_patterns en_core_web_lg "eruption" --api guardian --label "DISASTER"
and in there label multi-word tokens like ‘volcanic ash’ as a disaster. I did this for a few articles and saved the annotations. I thought the next step would be to generate a patterns file from the newly created annotations, so I ran:
(py3) ~/projects/tripler/data-analysis/spacy-ner (master): prodigy terms.to-patterns volcano_patterns
{"label": null, "pattern": [{"lower": "Bali: Mount Agung volcano monitored after second eruption"}]}
{"label": null, "pattern": [{"lower": "Asp \u2013 or ash? Climate historians link Cleopatra's demise to volcanic eruption"}]}
{"label": null, "pattern": [{"lower": "Bali volcano eruption could be hours away after unprecedented seismic activity"}]}
{"label": null, "pattern": [{"lower": "Bali: travel warning issued as volcano threatens to erupt"}]}
{"label": null, "pattern": [{"lower": "Mount Agung: Bali airport closed as volcano alert raised to highest level"}]}
{"label": null, "pattern": [{"lower": "Bali: Mount Agung volcano monitored after second eruption"}]}
{"label": null, "pattern": [{"lower": "Asp \u2013 or ash? Climate historians link Cleopatra's demise to volcanic eruption"}]}
{"label": null, "pattern": [{"lower": "Bali: Mount Agung volcano monitored after second eruption"}]}
{"label": null, "pattern": [{"lower": "Asp \u2013 or ash? Climate historians link Cleopatra's demise to volcanic eruption"}]}
{"label": null, "pattern": [{"lower": "Bali volcano eruption could be hours away after unprecedented seismic activity"}]}
I would have expected the label
to be DISASTER
and the pattern
to contain only the part of the text that I had marked, like for example ‘volcanic eruption’. I have the feeling that I did not really understand how to correctly use ner.manual
.
Can you please point me in the right direction?
Best, Stephan