Multiple dependency arcs for same entity


I am hoping for some advice or creative solutions to our challenge:

We are using Prodigy to train a Classifier and Entity Recogniser that works on incoming request emails, identifies their intent and then depending on the intent, extracts the necessary entities we need to fulfil their request.

Emails obviously have their challenges, but we have managed to train a Classifier to identify intents that are our services we offer, in this case i will be talking about a request to send an asset (e.g. a video file) to a destination.

We have also trained a NER model that works well to then identify the asset ID, TITLE, DURATION, etc.

The problem we now have is where people ask for multiple assets to be sent to multiple destinations, such as:

“Hi,\n\nPlease send ASSET1 and ASSET2 to DESTINATION 1, DESTINATION 2”

“Good morning,\n\nPlease distribute ASSET1 to DESTINATION1 and ASSET2 to DESTINATION2”

“Hello,\n\nPlease send to DESTINATION1 the following videos:\n\nASSET1 TITLE1 DURATION1, ASSET2 TITLE2 DURATION2”

We are thinking that we should train the relationships between Asset and Destination and can see how we could create the training data if there were just one asset to one or more destinations and then use dep.batch-train. But this is obviously not the case in the examples above.

My plan of action is to create new entities “ASSET_SET” and “DESTINATION_SET” that would look for neighbouring entities of the same types and then to train dependencies between them instead.

Thinking that we could use a patterns file and use our previous NER entity types to build the sets. Such as:

pattern = [{'ENT_TYPE': 'ASSET'},{'ORTH': ','},{'ENT_TYPE': 'ASSET'}]
pattern = [{'ENT_TYPE': 'ASSET'},{'IS_SPACE': True},{'ENT_TYPE': 'ASSET'}]
  1. Does this strategy make sense? Or is there a better approach here?
  2. How can we create the patterns such that there could be any number of assets or destinations in each set?

Thanks in advance for any help you can provide! Will provide more detail should you need it.

This type of problem is really quite challenging, and we don’t have direct support for it in spaCy and Prodigy yet. You could try to use dependency parse rules to get some of the simplest cases, but I think you’ll have other situations where the dependency path is quite unclear.

In general I would think of this as a problem of trying to predict as few additional bits of information as are required for the task. Perhaps you could try this:

  1. Recognise the assets
  2. Recognise the destinations
  3. If you have exactly one asset and one destination, no problem!
  4. If you have exactly two assets and two destinations, there are four possible combinations: both destinations pair to asset one, both destinations pair to asset two, the first asset pairs to the first destination, or the first asset pairs to the second destination. Maybe a text classifier can predict this?

I’m not sure whether the above would work, but it actually makes good theoretical sense, even though it sounds like a terrible hack.

Generally in any structured prediction problem (like predicting a tree or a graph), what we’d love to do is estimate a probability distribution over the space of all possible structures. The problem is, usually the space is very large, e.g. the space of possible binary trees over a sentence. We can’t enumerate all those trees, so we have to approximate, which means running a model over partial structures.

In your case, the structures you need to predict are very simple, so they’re actually trivial to enumerate. So we should just predict over the possible structures directly, using the whole text as the input features.