Use Case Feasibility


  • I am currently working on building a NER based system which detects SSN, account_ids in free form text fields (unstructured text ).
  • I have manually annotated around 1000 sentences with tags and I am planning to use this data set for training the other huge chunk of 50000 sentences using prodigy.
    q1) is it possible to create two new entity types using the data set containing the 1000 sentences ?
    q2) If yes what would be the ideal workflow ?
    q3) Is the prodigy model capable of picking the context from sentences (training data )?


Yes, you can definitely try to train that. You can pre-train a model using the 1000 examples you already have, and then improve it with the other unlabelled examples you have. For instance, you could use a recipe like ner.make-gold to see the model's predictions and correct them by hand.

You probably want to train a new entity recognizer from scratch instead of updating a pre-trained model, because you'll likely see a lot of conflicts between the pre-trained entity types for numbers etc. and the types you want to add. Teaching a pre-trained model a completely new definition with such little examples is really tricky and you'd always be fighting the existing weights.

You might also want to try augmenting your model with rules to improve the runtime accuracy.

In general, predicting spans of tokens based on the context is one of the key benefits of NER, yes :slightly_smiling_face:

Prodigy itself only ships with the annotation models – not the actual models you're training. The built-in recipes use spaCy for that, but you can also bring your own. spaCy's NER models (like many other similar implementations) are sensitive to the very local context, i.e. the surrounding words.