Training dependency parser for multi-word entities

Hi All,

I would like to train the parser to recognize a new dependency (called MONTHLY_SALES) between the DATE and MONEY entity types, such that the training data would look like:

        "Sales were $1 million in the first quarter of 2018.",
            "heads": [1, 1, 4, 4, 8, 1, 8, 8, 5, 8, 9, 1],
            "deps": ["nsubj", "ROOT", "quantmod", "compound", "MONTHLY_SALES", "prep", "det", "amod", "pobj", "prep", "pobj", "punct"],

It seems more logical, however, to treat the DATE and MONEY entities as spans whereby the merged entity tokens would be used instead of the individual tokens as the indices for the dependency arcs, as in:

(Note: in the below example “$1 million” is a MONEY entity and “the first quarter of 2018” is a DATE entity)

        "Sales were $1 million in the first quarter of 2018.",
            "heads": [1, 1, 4, 1, 3, 1],
            "deps": ["nsubj", "ROOT", "MONTHLY_SALES", "prep", "pobj", "punct"],

Which one of the training data examples would work, or do I need to use Prodigy?

Thanks for your assistance.

It’s probably a bad idea to try to predict the semantic relationship you’re interested in at the same time as you’re predicting the normal syntactic tree.

The parser has to predict a connected tree over the whole sentence. This means that the annotation scheme has to make sense over the language as a whole, or it will be really difficult to learn. There’s no syntactic relationship between $1 million and the first quarter of 2018, and in fact, you couldn’t easily learn to attach "quarter" to any word but "in": this type of construction (a prepositional phrase) is very common in English, so it’s important that they have a consistent annotation. If your scheme tries to teach the parser to annotate a minority of prepositional phrases differently, your scheme will be very difficult to learn.

The example of the preposition attachment might seem a bit trivial (after all, I’m sure you’d be happy to have the date phrase headed by “in”), but it’s a good illustration of the deeper point: because of the tree constraint, in the syntactic parser all the parts of the annotation scheme interact. You can really only change the annotation scheme if you’re careful to follow a linguistically precise definition of what you’re changing, so that you can ensure the annotation scheme remains consistent.

Instead of moving the relationship you want to learn into the syntactic annotation scheme, I would suggest having rules or a machine learning model that used the entity and parse annotations, and predicted the extra information you’re interested in. For instance, you can see the normal tree structure for the sentence here:

The currency and date phrase both connect to the root verb were, which is the correct syntactic structure for the sentence. It should be pretty easy to write rules to find sentences like this, based on the dependency parse.

I would use Prodigy’s textcat.teach recipe to try to tag sentences that express the relationship you’re interested in. Once you have the sentences, you can write rules that try to cover the sorts of constructions used to express the relationship. If rules don’t work well, you could also use a custom machine learning model.

Thanks for the detailed and informative reply @honnibal. I am not sure, however, how to use the syntactic dependencies between the verb(s) and currency/date pairs in a more realistic scenario. For example, the following sentence:

“Net income for the three months ended March 31, 2018, was $6.9 million, and represented earnings per share, basic and diluted, of $0.23 and $0.22, respectively.”

has a much more convoluted relationship between those entities; constructing rules to connect these objects does not become apparent when reviewing the sentence in displayC.

You mention using textcat.teach to tag sentences as an alternative strategy - can you point me to some examples where this recipe is being used?

Thanks for your help.

Well, I’m actually not sure how you’d like the relationship to look there. I think your main challenge will be to define a semantics that strikes a good balance between the level of detail you want, and internal consistency. Without the consistency, you’ll have trouble applying the annotation scheme accurately, and also have trouble recovering the relationships.

In general, you can view “net income” as a predicate in the construction above, and there are several ways of attaching attributes to it. One way is via the prepositional phrase (for the three months ended March 31, 2018). Another is attributes mediated by verbs like “was”. There’s a limited number of attributive verbs like that, and the dependency parse gives you clues by the attr relation. So you can fetch these verb-mediated attributes that way. Here’s an example:

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("Net income for the three months ended March 31, 2018, was $6.9 million, and represented earnings per share, basic and diluted, of $0.23 and $0.22, respectively.")
>>> net_income = doc[:2]
>>> print(net_income)
Net income
>>> attributes = []
>>> for pp in net_income.rights:
...   if pp.dep_ == "prep":
...     attributes.append([pobj for pobj in pp.rights if pobj.dep_ == "pobj"][0])
>>> attributes
>>> attributes[-1].subtree
<generator object at 0x7f219ab0c1f8>
>>> list(attributes[-1].subtree)
[the, three, months, ended, March, 31, ,, 2018]
>>> net_income[-1].head
>>> for child in net_income[-1].head.rights:
...   if child.dep_ == "attr":
...     attributes.append(child)
>>> print(["".join(w.text_with_ws for w in attr.subtree) for attr in attributes])
['the three months ended March 31, 2018', '$6.9 million']

Thanks for the reply and code sample. This works great for the first money entity ($6.9 million) but the attributive verb’s dependencies do not seem to extend to the other currencies ("$0.23 and $0.22") in the sentence. Is there a way to link these two currencies to the date using the information from the parser?

Thanks again for your explanations.

I think a heuristic that says “if there’s only one date in the sentence, it’s probably the date for the currency amounts” is likely to perform well. If you have cases where this doesn’t apply, you could try to use the dependency parse to detect those cases and filter them out. You might also be able to train a text classifier to learn that distinction.

@honnibal, I was hoping to get away from a rules-based approach which could quickly mushroom into a complex nightmare of brittle logic. If not a direct relationship between currencies and dates, is there a possibility of linking them through a third entity, say nominal subject? I personally could not find a consistent relationship there.

You mention that I could use the dependency parse to filter out outliers to the heuristic or train a text classifier to make that distinction. I’m not exactly clear on how that could be done - can you provide some further explanations?

Lastly, I find that the NER makes some mistakes with regards to date recognition (e.g. “in the current period”, “Q1”, etc.) which are very specific to the financial domain. What course of action do you recommend for improving this performance?