@ines Thanks a lot !
it works now
Here I share my experience:
My source data is in jsonl format and look like:
{"text":"abcd","meta":{"source":"doc1"}}
.
.
I wrote a code (compatible with SpaCy 2.5) based on your explanation to read a set of documents and annotate them based on patterns file:
# path of jsonl file contains the performed annotation to be loaded in the db
db_jsonl_path='db_jsonl.jsonl'
nlp = English()
ruler = EntityRuler(nlp)
# the patterns file
ruler.from_disk('patterns.jsonl')
nlp.add_pipe(ruler)
# source data in jsonl format
source_path='soure_data.jsonl'
# Using readlines()
source_file = open(source_path, 'r')
Lines = source_file.readlines()
for line in Lines:
data = json.loads(line.strip())
input=data['text']
doc = nlp(input)
spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
example = {"text": doc.text, "spans": spans}
with open(db_jsonl_path, 'w') as f:
f.write(json.dumps(example+'\n')
when done, load the performed annotation, stored in db_jsonl_path, into a prodigy db:
prodigy db-in db_name path/db_jsonl.jsonl
I still have a simple question, how to add the meta data ("meta":{"source":"doc1"}) into the spans so it can be stored in the db later a long with other information like entities, position, label etc.