Sure, this is definitely possible! Just to make sure I understand correctly – do you want to extract all DATE
entities that the model already recognises and use those for a new entity type, or improve the label DATE
by passing in patterns?
Here are some ideas and suggestions for both workflows, since I’m sure this might be helpful to others as well
Improving DATE
entity with patterns
Here, we’re using token match patterns to suggest more examples of dates, to improve the model’s DATE
category.
1. Collect examples and test tokenization
Collect some examples of the dates you’re expecting and check how your model tokenizes them. This is important, because the patterns refer to the individual tokens – and ideally, you want to create generalised match patterns instead of exact patterns for each possible date in history. For example:
doc = nlp(u"Dec 20, 2017")
[token.text for token in doc] # ['Dec', '20', ',', '2017']
doc = nlp(u"20.12.2017")
[token.text for token in doc] # ['20.12.2017']
2. Write token match patterns
Check out the spaCy documentation on match patterns and create patterns for your dates that capture the tokens. A pattern consists of one dictionary per token and include the available token attributes and flags. A nice trick here is to work with the token’s .shape_
attribute, which gives you a generalised representation of how the text – e.g. whether it contains digits, alphanumeric characters or punctuation. For example, the shape of 20.12.2017
is 'dd.dd.dddd'
. Your patterns.jsonl
file could then look something like this:
{"label": "DATE", "pattern": [{"shape": "Xxx"}, {"is_digit": true}, {"orth": ","}, {"shape": "dddd"}]}
{"label": "DATE", "pattern": [{"shape": "dd.dd.dddd"}]}
You might have to experiment a little here to capture different variations. To test your patterns, you can always convert them to Python, add them to spaCy’s Matcher
and try them out on a few examples.
3. Start teaching with the patterns file
You can load in the patterns file by setting the --patterns
argument on ner.teach
. Don’t forget to also set the label to DATE
to make sure you’re only annotating date entities. For example:
prodigy ner.teach my_dataset en_core_web_sm my_data.jsonl --label DATE --patterns patterns.jsonl
Prodigy will now stream in your text and show you both suggestions from your patterns, as well as suggestions made by the model. You can see where the suggestion is coming from in the annotation task meta displayed in the bottom right corner of the card.
If you only want to label examples from the patterns file without a model in the loop, you can also use the ner.match
recipe (see here for details). You’ll still have to use a model, but it’s only needed for tokenization and sentence boundary detection:
ner.match my_dataset en_core_web_sm my_data.jsonl --patterns patterns.jsonl
I’d suggest collecting a few hundred annotations before starting your first training experiments.
Extracting existing entities as patterns
If you want to extract all DATE
spans that are already recognised in the corpus and convert them to patterns, you’ll probably want to process your corpus first, get all dates, convert them to patterns, save out the file and then use it with ner.teach
or ner.match
. For example, you could do something like this:
import spacy
from pathlib import Path
texts = ['text one...', 'text two...', 'text three...'] # your corpus
label = 'NEW_LABEL' # the label you want to assign to the patterns
patterns = [] # collect patterns here
nlp = spacy.load('en_core_web_sm') # or any other model
docs = nlp.pipe(texts) # use nlp.pipe for efficiency
for doc in docs:
for ent in doc.ents:
if ent.label_ == 'DATE': # if a DATE entity is found
entry = {'label': label, 'pattern': [{'lower': ent.text}]}
patterns.append(entry)
# dump JSON and write patterns to file
jsonl = [json.dumps(pattern) for pattern in patterns]
Path('patterns.jsonl').open('w', encoding='utf-8').write('\n'.join(data))
This will give you a JSONL file with one entry per DATE
entity found in your corpus, e.g. {"lower": "Dec 2017"}
. You’ll probably also want to tweak the script to make sure your patterns don’t contain any duplicates. You can then load in the file in ner.teach
, just like in the example above:
prodigy ner.teach my_dataset en_core_web_sm my_data.jsonl --label NEW_LABEL --patterns patterns.jsonl
I hope this answered your questions – let me know if I forgot something or an aspect of the workflow is still unclear!