Hi! Thanks for sharing your process and workflow. The problems you describe are definitely non-trivial, so I hope you're not too discouraged by the results so far. I think the general approach you choose made sense, but there are a few potential issues:
There are a few problems I see with your patterns here: First, keep in mind that those are exact match patterns. So the first pattern will only match exact occurrences of tokens whose lowercase text is identical to "la120229.4311". So unless this is a super common example in your data, you'll likely won't see any matches here. Instead. it makes more sense to work with more abstract token attributes, like the shape
, e.g. token.shape_
, which could be something like "ddxd.d"
(digit digit alpha period digit).
You also want to make sure to verify that spaCy's tokenization matches the tokens defined in the patterns. Patterns are token-based, so each entry in the list should represent one single token and their attributes. This is also the reason why your second pattern will never match: there won't be a token whose lowercase text matches the string "Golden Fantastic Airlines Public Co. Ltd.", because spaCy will split this into several tokens:
nlp = spacy.load('en_core_web_sm')
doc = nlp("Golden Fantastic Airlines Public Co. Ltd.")
print([token.text for token in doc])
# ['Golden', 'Fantastic', 'Airlines', 'Public', 'Co.', 'Ltd.']
You might find our Matcher
demo useful, which lets you construct match patterns interactively, and test them against your text:
That's difficult to say and really depends on the data and results. Especially since some of your problems were likely caused by the suboptimal patterns. I'd say that based on your descriptions, there are mostly three possible solutions and annotation strategies:
- Use
ner.teach
with better patterns that produce more matches. This will make it easier to move the model towards the desired definitions. - Try
ner.make-gold
with all labels that you need (e.g.MONEY
,DATE
,ORG
and your new types likeCONTRACT_NUMBER
). This way, your training data will include both your new definitions as well as entities that the model previously got right. This can prevent the so-called "catastrophic forgetting", and it lets you train withner.batch-train
and the--no-missing
flag, telling spaCy that the annotations cover all entities. This can produce better accuracy, because non-annotated tokens are considered "not an entity", instead of "maybe an entity, maybe not, don't have data for it". - Start with a blank model instead of a pre-trained model and teach it about your categories from scratch. This might require slightly more data, but it also means that the pre-trained weights won't interfere with your new definitions. If you write enough descriptive patterns for entity candidates, you can still use the
ner.teach
recipe to collect training data. Alternatively, you could usener.manual
to create a gold-standard set from scratch.
The best way to find out which approach works best is to start trying them – this is often what it comes down to, and allowing fast iteration and experiments was one of the key motivations for us to develop Prodigy
I'd also recommend holding back some of your data and using ner.manual
to create a gold-standard evaluation set. This will make it easier to reliably compare different approaches you try, and figure out which training dataset produces the best accuracy. By default, Prodigy's batch-train
recipes hold back a certain percentage of your training data, which is okay for quick experiments and an approximation. But once you're getting more serious about finding the best training approach, you usually also want a dedicated evaluation set.
In theory, yes – assuming that the conclusion can be drawn from the local context. This is where statistical NER can be very powerful, because it lets you generalise based on similar examples.
However, if you're updating a pre-trained model, it's not always the best approach to try and teach it a completely new definition of an entity type. For example, it might not be very efficient to try and teach the pre-trained model that the tokens "MSN 1298" should not be analysed as ["O", "U-CARDINAL"]
(outside an entity, unit entity) but instead as ["B-SERIAL_NUMBER", "L-SERIAL_NUMBER"]
(beginning and last token of an entity). Instead, it might make more sense to solve this by writing token-based rules that check for "MSN" followed by one or more number tokens.
Similarly, the following example might also be a better fit for a combination of predicting the DATE
entity type and then using rules or a separate statistical process to determine whether it's a contractual date or not:
So instead of annotating an entirely new category CONTRACT_DATE
that conflicts with the existing DATE
label, you probably want to try improving the existing DATE
label on your data so it makes a little errors as possible, and then adding a second process on top to label it as the subtype CONTRACT_DATE
. See this thread on nested labels for an example.
The following threads discuss similar approaches and ideas for combining statistical models with rule-based systems:
If you haven't seen it already, you might also like @honnibal's talk about the iterative development approach and how to find the right approach for data collection. The crime / victim / crime location example shown at 11:38 is actually kinda similar to your "contractual date" type.
Shorter units of text are usually easier to work with, and also make annotation more efficient, since you can focus on smaller chunks at a time. So if you have control over the incoming data, and you can decide what the model should see at runtime, you might as well use single sentences. Just make sure that the training data is similar to what the model will see at runtime and vice versa.