Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy

Hello,
I want to build up an NER system to extract some information out of texts. my texts are in german and have some special cases that I want to add them to the tokenizer to be handled in the right way, for example I have unique identifiers like "F01-RS2:2" and this must be recognized as ONE token and classified as IDENTIFIER (after training prodigy).

I followed the following steps:
1- created spacy blank model in german
2- I added special_cases to tokenizer by
3- I saved the model using:

nlp=spacy.load('blank:de') 
nlp.tokenizer.add_special_case('°C', [{"ORTH": '°C'}])
nlp.tokenizer.add_special_case('Xdd-XXd:d', [{"SHAPE": 'Xdd-XXd:d'}])

nlp.to_disk('./my-model')

After that I changed the name of the model in ./my-model/meta.json to "name":"extended_de_model"

4- I created custom model using spacy package in Terminal:

spacy package ./my-model/ ./model_packages
#then
pip install de_extended_de_model

5- I started Prodigy by:

prodigy ner.manual all_annotation3 de_extended_de_model ./input_prodigy.jsonl --label CLASS,UI,VALUE --patterns ./patterns.jsonl

and annotated all texts...

6- I trained the model prodigy train --ner all_annotation3 tmp/trained-model --eval-split 0.25


FIRST PROBLEM:
the saved models model-last and model-best are english models!
checked in "lang":"en" in meta.json

SECOND PROBLEM:
the tokenizer of that model is not the same as I created in step 3 and 4

I tried to add_pipe ner from model-best to de_extended_de_model, but it fails also in labeling.

MY GOAL:
I want to have in the end one unified model that tokenizes the way i predefined (using special cases) and then labels the tokens correctly because I want to do some further works in rule-based pattern matching and dependency parsing/matching and so on.

Note: I tried also to define a new model like de_core_news_lg and add the ner as add_pipe from model-best but it recognizes the entities in a wrong way... :frowning_face:

Thanks :slight_smile:

hi @Rashid!

Thanks for your example!

A few things:

Make sure you add --lang de to either prodigy train (see docs) or (if you want to train directly in spaCy) to prodigy data-to-spacy. Also, you'll likely want to use the --base-model option to pass your custom tokenize model. Without it, it'll just use the default tokenizer (given the language provided).

In general, I'd recommend using data-to-spacy as it'll produce a config file that you can see and verify that the parts of your pipeline is correct. prodigy train is really just a wrapper for spacy train but hides the config file (hence, causes confusion when you're training more advanced models like modifying tokenization).

For this, I had to recreate everything from scratch (fyi, I'm not an expert in spaCy tokenizer, nor do I speak German, but I tried :slight_smile: ).

Key is if I have this text: "Die Temperatur heute beträgt 25 °C.".

The default tokenizer will tokenize it as:

[('TOKEN', 'Die'),
 ('TOKEN', 'Temperatur'),
 ('TOKEN', 'heute'),
 ('TOKEN', 'beträgt'),
 ('TOKEN', '25'),
 ('SPECIAL-1', '°'),
 ('SPECIAL-2', 'C'),
 ('SPECIAL-3', '.')]

But you want to make sure your special rule applies like this (see nlp2 in the gist):

[('TOKEN', 'Die'),
 ('TOKEN', 'Temperatur'),
 ('TOKEN', 'heute'),
 ('TOKEN', 'beträgt'),
 ('TOKEN', '25'),
 ('SPECIAL-1', '°C'), # makes sure this is kept
 ('SPECIAL-3', '.')]

One small difference I had to make was that I had to provide the path to the pipeline to pip install it.

So not:

pip install de_extended_de_model

But:

pip install model_packages/de_extended_de_model-0.0.0/

Also, I had errors running this step (maybe missing a character?):

nlp.tokenizer.add_special_case('Xdd-XXd:d', [{"SHAPE": 'Xdd-XXd:d'}])

so I skipped it.

Can you see my gist and see if you can confirm? You really want to make sure this works or else it Prodigy won't work as it is simply reading in the spaCy pipeline.

When I then ran:

prodigy ner.manual de-token de_extended_de_model input.jsonl --label TEMPERATURE

My annotations/Prodigy seemed to tokenize correctly. For example, here's one of the annotations:

{
  "text": "Gestern war es sehr heiß, die Temperatur erreichte 35 °C.",
  "_input_hash": 1883056005,
  "_task_hash": -2135525849,
  "_is_binary": false,
  "tokens": [
    {
      "text": "Gestern",
      "start": 0,
      "end": 7,
      "id": 0,
      "ws": true
    },
   ...
    {
      "text": "°C", # the custom tokenization is used
      "start": 54,
      "end": 56,
      "id": 10,
      "ws": false
    },
    {
      "text": ".",
      "start": 56,
      "end": 57,
      "id": 11,
      "ws": false
    }
  ],
  "_view_id": "ner_manual",
  "spans": [
    {
      "start": 51,
      "end": 56,
      "token_start": 9,
      "token_end": 10,
      "label": "TEMPERATURE"
    }
  ],
  "answer": "accept",
  "_timestamp": 1688759766,
  "_annotator_id": "2023-07-07_15-55-52",
  "_session_id": "2023-07-07_15-55-52"
}

Perhaps try to remove ner. For example:

import spacy
nlp=spacy.load('de_core_news_lg')
nlp.tokenizer.add_special_case('°C', [{"ORTH": '°C'}])
nlp.remove_pipe("ner")
nlp.to_disk('./my-model')

Without this, you'll get an error that you're missing a tok2vec component when you try to train with ner.

See if you can try these steps.

If you're still having issues, I'd recommend posting on the spaCy discussions forum. Your questions are much more spaCy than Prodigy -- but more importantly, the spaCy core team responds that forum so you'll get the best spaCy developers :slight_smile:

Hope this helps!