I'm building a new NER model.
is this sane?
(also a question about PhraseMatcher coming in in 1 minute)
Some entities exist in the base model (e.g GPE, LOC, DATE, TIME). - I'd like to use that when creating a silver training set.
Some entities are new.
of the new ones, several are found in custom databases, (tens of thousands of named entities belonging to one of several categories.)
Other new entities are some domain-specific terms.
I would like to bootstrap the annotations with patterns to speed things up for training the new model.
Here's my plan (and things I haven't been able to do):
-
For the Entities that exist in the base model, I can run the base model over a few hundred sample texts and annotate them with the existing model (then review these entities with
ner.manual
and e.g.--label GPE,LOC,DATE,TIME
)
(I guess at this point I could also be usingteach
, with that same base model that generated them, though I don't expect it to be quicker, being that it was a trained model to begin withen_core_web_trf
was trained on a lot more than I can provide)
This works -
For the DB-based new entities, my plan was to run
ner.manual
with--pattern ./patt.jsonl
after generating thepatt.jsonl
file that looks like this :
{"label": "CUSTOM_L", "pattern": [{"LOWER": "token1"}, {"LOWER": "token2"}]}
THIS WORKS, BUT...- I saw some comments how phrase matcher is faster, and tried to create a file that looks like this
{"label": "CUSTOM_L", "pattern": "token1 token2"}
{"label": "CUSTOM_L", "pattern": "Token1 Token2"}
{"label": "CUSTOM_L", "pattern": "TOKEN1 TOKEN2"}
butprodigy
crapped out and wouldn't start even after 15 minutes. (it would start when I truncated that file to 1000 lines)
i.e. this:prodigy ner.manual single_silver_ENT_A en_core_web_trf ./samples_s.jsonl --patterns ENT_A_PHRASES.jsonl --label ENT_A
results in this
Using 1 label(s): ENT_A
and then nothing happens. - I thought of creating a pipeline inside Spacy and adding the PhraseMatcher, as a step in the pipeline similar to the entity_ruler - in the same capacity, but did not find a way to do it. Am I missing something ???
- I saw some comments how phrase matcher is faster, and tried to create a file that looks like this
-
for the others I can use the LLM labeling bootstrapping method.
Create a few silver datasets, each with different entities, all on the same sample texts.
then use silver to gold, then train
then correct
.
(the nice diagram doesn't have silver to gold, or teach.. )