Store the annotation obtained by ner.manual and --patterns at once

I am using the ner.manual recipe with patterns to annotate a given text

prodigy ner.manual dataset spacy_model source --label --patterns

However, currently I don't like to go through all annotations and confirm them one by one instead I want to store at once in the given database all matched entities/labels with the one in the patterns file. The initial annotation I will then use to build the model in an active learning scenario.

Hi! In that case, you could just load the patterns with spaCy directly to label all matches automatically and then use that data to pretrain you model. My comment here explains how to do this:

Using the EntityRuler has the advantage that it takes patterns in the same format as Prodigy and takes care of filtering out overlaps (which can theoretically occur with multiple patterns).

@ines Thanks a lot !
it works now

Here I share my experience:
My source data is in jsonl format and look like:

{"text":"abcd","meta":{"source":"doc1"}}
.
.

I wrote a code (compatible with SpaCy 2.5) based on your explanation to read a set of documents and annotate them based on patterns file:

# path of jsonl file contains the performed annotation to be loaded in the db
db_jsonl_path='db_jsonl.jsonl'
nlp = English()
ruler = EntityRuler(nlp)
# the patterns file
ruler.from_disk('patterns.jsonl') 
nlp.add_pipe(ruler)

# source data in jsonl format
source_path='soure_data.jsonl'
# Using readlines()
source_file = open(source_path, 'r')
Lines = source_file.readlines()
 
for line in Lines:
    data = json.loads(line.strip())
    input=data['text']
    doc = nlp(input)
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    example = {"text": doc.text, "spans": spans}
    with open(db_jsonl_path, 'w') as f:
                       f.write(json.dumps(example+'\n')

when done, load the performed annotation, stored in db_jsonl_path, into a prodigy db:

prodigy db-in db_name path/db_jsonl.jsonl

I still have a simple question, how to add the meta data ("meta":{"source":"doc1"}) into the spans so it can be stored in the db later a long with other information like entities, position, label etc.

You can add all of that to the dict that you create as example in your code :slightly_smiling_face: The "text" and "spans" are what's required to annotate named entities, but you can also include a key "meta" with custom properties – for example, the index of the current line (you can just increment a counter variable or use Python's enumerate()).

Everything in "meta" will be be displayed in the bottom right corner of the annotation card. You can also include any other custom properties in the example that will be saved with the annotations in the database (e.g. for meta infor that you don't want to display in the UI).

Thanks a lot, I added the meta data and it is displayed in the bottom right corner of the annotation card:

# path of jsonl file contains the performed annotation to be loaded in the db
db_jsonl_path='db_jsonl.jsonl'
nlp = English()
ruler = EntityRuler(nlp)
# the patterns file
ruler.from_disk('patterns.jsonl') 
nlp.add_pipe(ruler)

# source data in jsonl format
source_path='soure_data.jsonl'
# Using readlines()
source_file = open(source_path, 'r')
Lines = source_file.readlines()
 
for line in Lines:
    data = json.loads(line.strip())
    input=data['text']
    doc_id=data['meta']
    doc = nlp(input)
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    example = {"text": doc.text, "spans": spans,"meta":doc_id}
    with open(db_jsonl_path, 'a') as f:
                       f.write(json.dumps(example+'\n')