Passing additional information to NER model


(Algis Dumbris) #1

Usually, a semi-structured data has several fields, for instance - name, description, comment, etc.
I would like to pass this meta information to NER model.
I can format string like “<NAME> name text <DESC> description text <COMM> comment text”. In this case, as I understand, I need to add <NAME>, <DESC>, <COMM> - as special words into the vocabulary and teach tokenizer to keep it as a single token.

Does it make sense to incorporate such information about fields division into input for NER model? I mean for short texts 1-10 words.
Could you suggest the best way to do it with minimal customization default ner.teach / ner.batch-train recipes?

P.S.: Thank for the great Prodigy tool. :slight_smile:

(Matthew Honnibal) #2

That’s a reasonable idea. I’d like for it to be easier to add features to the NER model, but currently we don’t have a good solution for that. So, what you’re suggesting makes sense. I think you can just craft your tokens so that the tokenizer naturally keeps them together, something like μCOMMμ should work. The tokenizer should keep that together, and it should be unambiguous enough.

If you don’t want to change the recipes, you could put the data generation or manipulation code into a separate script that writes to stdout. Most of the scripts accept input from stdin, so you can just pipe data from your generator script forward into the recipe.