Seeding named entity detection annotation with a pattern


(W.P. McNeill) #1

I’m doing named entity detection to pick out particular dates in a document. It takes human judgement to distinguish the desired dates, but everything is going to be a date.

This is like the named entity detection task you describe in the Prodigy video tutorials, except instead of a list of seed words, I want the seeds to just be all the DATE spans in my corpus. I think this should be easy, but I haven’t figured out the exact commands yet.

Highlighting spans during text classification annotation
(Ines Montani) #2

Sure, this is definitely possible! Just to make sure I understand correctly – do you want to extract all DATE entities that the model already recognises and use those for a new entity type, or improve the label DATE by passing in patterns?

Here are some ideas and suggestions for both workflows, since I’m sure this might be helpful to others as well :blush:

Improving DATE entity with patterns

Here, we’re using token match patterns to suggest more examples of dates, to improve the model’s DATE category.

1. Collect examples and test tokenization

Collect some examples of the dates you’re expecting and check how your model tokenizes them. This is important, because the patterns refer to the individual tokens – and ideally, you want to create generalised match patterns instead of exact patterns for each possible date in history. For example:

doc = nlp(u"Dec 20, 2017")
[token.text for token in doc]  # ['Dec', '20', ',', '2017']

doc = nlp(u"20.12.2017")
[token.text for token in doc]  # ['20.12.2017']

2. Write token match patterns

Check out the spaCy documentation on match patterns and create patterns for your dates that capture the tokens. A pattern consists of one dictionary per token and include the available token attributes and flags. A nice trick here is to work with the token’s .shape_ attribute, which gives you a generalised representation of how the text – e.g. whether it contains digits, alphanumeric characters or punctuation. For example, the shape of 20.12.2017 is 'dd.dd.dddd'. Your patterns.jsonl file could then look something like this:

{"label": "DATE", "pattern": [{"shape": "Xxx"}, {"is_digit": true}, {"orth": ","}, {"shape": "dddd"}]}
{"label": "DATE", "pattern": [{"shape": "dd.dd.dddd"}]}

You might have to experiment a little here to capture different variations. To test your patterns, you can always convert them to Python, add them to spaCy’s Matcher and try them out on a few examples.

3. Start teaching with the patterns file

You can load in the patterns file by setting the --patterns argument on ner.teach. Don’t forget to also set the label to DATE to make sure you’re only annotating date entities. For example:

prodigy ner.teach my_dataset en_core_web_sm my_data.jsonl --label DATE --patterns patterns.jsonl 

Prodigy will now stream in your text and show you both suggestions from your patterns, as well as suggestions made by the model. You can see where the suggestion is coming from in the annotation task meta displayed in the bottom right corner of the card.

If you only want to label examples from the patterns file without a model in the loop, you can also use the ner.match recipe (see here for details). You’ll still have to use a model, but it’s only needed for tokenization and sentence boundary detection:

ner.match my_dataset en_core_web_sm my_data.jsonl --patterns patterns.jsonl

I’d suggest collecting a few hundred annotations before starting your first training experiments.

Extracting existing entities as patterns

If you want to extract all DATE spans that are already recognised in the corpus and convert them to patterns, you’ll probably want to process your corpus first, get all dates, convert them to patterns, save out the file and then use it with ner.teach or ner.match. For example, you could do something like this:

import spacy
from pathlib import Path

texts = ['text one...', 'text two...', 'text three...']  # your corpus
label = 'NEW_LABEL'  # the label you want to assign to the patterns
patterns = []  # collect patterns here

nlp = spacy.load('en_core_web_sm')  # or any other model
docs = nlp.pipe(texts)  # use nlp.pipe for efficiency
for doc in docs:
    for ent in doc.ents:
        if ent.label_ == 'DATE':   # if a DATE entity is found
            entry = {'label': label, 'pattern': [{'lower': ent.text}]}

# dump JSON and write patterns to file
jsonl = [json.dumps(pattern) for pattern in patterns]
Path('patterns.jsonl').open('w', encoding='utf-8').write('\n'.join(data))

This will give you a JSONL file with one entry per DATE entity found in your corpus, e.g. {"lower": "Dec 2017"}. You’ll probably also want to tweak the script to make sure your patterns don’t contain any duplicates. You can then load in the file in ner.teach, just like in the example above:

prodigy ner.teach my_dataset en_core_web_sm my_data.jsonl --label NEW_LABEL --patterns patterns.jsonl 

I hope this answered your questions – let me know if I forgot something or an aspect of the workflow is still unclear!

(W.P. McNeill) #3

I have a slightly different workflow. I have building specs that mention several dates, and one of those dates is effective date on which the work starts. You can tell effective dates because they occur in a sentence that says something like “We will start working on January 2, 2010.”

I want to label “January 2, 2010” in the above context an EFFECTIVE_DATE entity. If there were another sentence in the document that said “We will finish work no later than March 3, 2012”, that would not be an EFFECTIVE_DATE. The built-in DATE entity detector is fine for identifying the candidate dates: I’m training a model to distinguish the different contexts.

I created a date-patterns.jsonl file with the following line

{"label": "EFFECTIVE_DATE", "pattern": [{"ENT_TYPE": "DATE"}]}

I then run the following command

prodigy ner.teach my_database en_core_web_lg corpus.jsonl --label EFFECTIVE_DATE --patterns date-patterns.jsonl

This almost does what I want. Prodigy suggests EFFECTIVE_DATE candidates for me to accept or reject, but they are always single tokens. In my first example it would ask if “January” was an EFFECTIVE_DATE, then “2”, then the comma, then “2010”. This surprises me because DATE entities can be multitoken, for example

>> d = nlp("We will start working on January 2, 2010.").ents[0]
>> d.text, d.label_
('January 2, 2010', 'DATE')

What I want Prodigy to do is is propose the entire “January 2, 2010” span as a single candidate EFFECTIVE_DATE. Is this possible?

(Ines Montani) #4

That’s a nice solution, actually – I didn’t even think of the ENT_TYPE attribute! I also like the use case, so thanks for sharing that :blush:

I think what’s going on here is that your pattern [{"ENT_TYPE": "DATE"}] will match one token with the entity type DATE and suggest it as EFFECTIVE_DATE. So out of all the possible analyses, it’s showing you only single-token entity candiates.

If you know the rough date formats you’re looking for – e.g. that they’re usually 3 or 4 tokens long – you should be able to just create one or more patterns with several tokens:

[{"ENT_TYPE": "DATE"}, {"ENT_TYPE": "DATE"}, {"ENT_TYPE": "DATE"}, {"ENT_TYPE": "DATE"}]

If you have one pattern with 3 tokens and one with 4, you might see some overlap and will have to reject more suggestions based on the patterns. But it’ll also mean that it’ll capture more dates, like “January 2, 2010” and “January 2 2010”.

(W.P. McNeill) #5

Thanks. After some experimentation I found the best approach was to combine ENT_TYPE = DATE with more surface-like features such as ORTH = , and IS_ALPHA and IS_DIGIT to distinguish between months, days and years.

(Bhanu Sharma) #6

Hey @ines, prodigy won’t catch 3 token dates if i create pattern such as:

[{"ENT_TYPE": "DATE"}, {"ENT_TYPE": "DATE"}, {"ENT_TYPE": "DATE"}, {"ENT_TYPE": "DATE"}]

i.e it will catch “July 2nd, 2017” but not “July 2nd 2017” ?
Is there any way to catch all such instances ?

(Ines Montani) #7

You might want to try using more explicit patterns based on the lexical attributes, rather than relying on the entity types predicted by the model. If the model you’re using doesn’t actually predict the entity type DATE for your text (which is always context-sensitive), your pattern also won’t match.

Dates are pretty nice here, because they usually follow at least some kind of scheme. So you could, for example, use the token’s SHAPE attribute, optional tokens (for the comma) or the IS_DIGIT flag. The following pattern will match one token (anything), a token of the shape dxx (digit and two lowercase letters, e.g.“2nd”) an optional comma token and one token consisting of 4 digits:

[{}, {"SHAPE": "dxx"}, {"ORTH": ",", "OP": "?"}, {"SHAPE": "dddd"}]

Alternatively, this will match one token, one digit token, an optional comma and a 4-digit token:

[{}, {"IS_DIGIT": true}, {"ORTH": ",", "OP": "?"}, {"SHAPE": "dddd"}]

Since there are only 12 months, you could also create copies of those patterns for each month, and make the first token {"LOWER": "january"} etc. However, it’s also totally fine if your patterns are a little ambiguous and produce false positives. In fact, this can actually be very good, because it ensures that your model sees both positive and negative examples of very similar spans of text.

You can find more details and examples of possible patterns in spaCy’s Matcher docs. It can also help to tokenize some of your example with spaCy and inspect the tokens to find the best generalisable patterns and make sure spaCy’s tokenization matches your patterns.

(W.P. McNeill) #8

The following has been an effective set of date patterns for me

{"example": ["September 30, 1971", "September 30 1971"], "pattern": [{"ENT_TYPE": "DATE", "IS_ALPHA": true}, {"ENT_TYPE": "DATE", "IS_DIGIT": true}, {"ENT_TYPE": "DATE", "ORTH": ",", "OP": "*"}, {"ENT_TYPE": "DATE", "IS_DIGIT": true}], "label": "MY_DATE"}
{"example": ["30 September, 1971", "30 September 1971"], "pattern": [{"ENT_TYPE": "DATE", "IS_DIGIT": true}, {"ENT_TYPE": "DATE", "IS_ALPHA": true}, {"ENT_TYPE": "DATE", "ORTH": ",", "OP": "*"}, {"ENT_TYPE": "DATE", "IS_DIGIT": true}], "label": "MY_DATE"}
{"example": ["1st day of September, 1971"], "pattern": [{"SHAPE": "dxx"}, {"LOWER": "day"}, {"LOWER": "of"}, {"ENT_TYPE": "DATE", "IS_ALPHA": true}, {"ENT_TYPE": "DATE", "ORTH": ",", "OP": "*"}, {"ENT_TYPE": "DATE", "IS_DIGIT": true}], "label": "MY_DATE"}
{"example": ["30th day of September, 1971"], "pattern": [{"SHAPE": "ddxx"}, {"LOWER": "day"}, {"LOWER": "of"}, {"ENT_TYPE": "DATE", "IS_ALPHA": true}, {"ENT_TYPE": "DATE", "ORTH": ",", "OP": "*"}, {"ENT_TYPE": "DATE", "IS_DIGIT": true}], "label": "MY_DATE"}
{"example": ["1/1/1971"], "pattern": [{"SHAPE": "d/d/dddd"}], "label": "MY_DATE"}
{"example": ["10/1/1971"], "pattern": [{"SHAPE": "dd/d/dddd"}], "label": "MY_DATE"}
{"example": ["1/10/1971"], "pattern": [{"SHAPE": "d/dd/dddd"}], "label": "MY_DATE"}
{"example": ["10/10/1971"], "pattern": [{"SHAPE": "dd/dd/dddd"}], "label": "MY_DATE"}
{"example": ["1/1/71"], "pattern": [{"SHAPE": "d/d/dd"}], "label": "MY_DATE"}
{"example": ["10/1/71"], "pattern": [{"SHAPE": "dd/d/dd"}], "label": "MY_DATE"}
{"example": ["1/10/71"], "pattern": [{"SHAPE": "d/dd/dd"}], "label": "MY_DATE"}
{"example": ["10/10/71"], "pattern": [{"SHAPE": "dd/dd/dd"}], "label": "MY_DATE"}

(Ines Montani) #9

Ah, I love the idea of adding the examples to the pattern entries! :+1:

We should probably recommend this as a “best practice” when working with patterns – it makes it so much easier to debug them later on. Prodigy could even include an optional pattern validation that tests the pattern against the examples, and warns the user if it doesn’t match. Something like this, just potentially built into the PatternMatcher:

import prodigy
from prodigy.util import read_jsonl
from spacy.matcher import Matcher

def validate_patterns(patterns_path, spacy_model):
    nlp = spacy.load(spacy_model)
    patterns = read_jsonl(patterns_path)
    for pattern in patterns:
        if 'example' in pattern:
            matcher = Matcher(nlp.vocab)
            matcher.add('PATTERN', None, pattern)
            for text in pattern['example']:
                doc = nlp(text)
                if not matcher(doc):
                    print("WARNING: No match", pattern, text,
                          [token.text for token in doc])

(Piyush Raj) #10

@wpm Hello, I’m following this thread to extract Dates from text corpus, I’m using Spacy’s ner model to get dates and have created some custom rules using Matcher class to identify some missed date formats. Here is my code, I need to test my approach and want to see its coverage for different kinds of date formats but I don’t have the data to do so, if I could get such data for doing this coverage analysis it would be of great help.