I have a project where I've done a lot of custom entity tagging, but I also want to train a model for custom semantic relations similar to what's described here: https://spacy.io/usage/training#intent-parser
I don't see how the dep.teach recipe can be used here.
I've looked at tools like Brat and Webanno, but I'm quite confused with all the formats... has anyone figured out the best workflow for this?
We have a solution in mind for this, but we haven't had a chance to implement it yet. Currently most people are interested in span identification (like NER) or text classification problems, so annotating trees and graphs hasn't been as much of a priority.
The other problem is that tree- and graph-problems are often quite different from each other, suggesting different solutions. The main question is: how complicated are your trees? Do they only have a few relations per sentence, or are they quite dense?
If most of the information is in identifying the anchors of the relations (e.g. entity spans), and the relation only provides something like a directional relationship (such as which company bought the other), you may find that you can do the relation classification as text classification. For instance, you'd have a text classification label that said A buys B and a second class for B buys A. I would suggest this as a good approach so long as you will only have a few dozen labels. If you can do it this way, you'll also probably get great accuracy: the model will perform so much better if it gets to predict the whole structure like that, instead of having to compromise and predict the edges individually.
If you do have complicated relations to predict, then I would probably suggest doing the annotations in a tool like WebAnno or BRAT currently. As I said, we hope to have support for this in future --- but currently at least there are free tools that should be quite productive.
Thanks @honnibal, good to know that it's on the to-do list for Prodigy! And of course, I assume that if I first create a custom spacy parser model with a little bit of manually created seed training data, I could technically use dep.teach after that?
Good point about using text classification in the simple cases... I think mine requires some more structure though. Not dense like dependency parsing, but more than a single relation per sentence. Here's an example:
I want to exchange black t-shirts for white jeans... 10 of the former, 5 of the latter
I.e., a sentence can contain multiple products, each with accompanying attributes, and then I can have disjointly placed entities that reference specific products mentioned. Seems to me this would be best handled by a relationship parser, do you agree? Or is there a simpler approach?
Bonus question: let's say you have an ontology where your objects are uniquely defined by a set of attributes, but you're only selling one type of object so it's not necessary to specify the root:
Instead of
"I want two black t-shirts in medium and three white t-shirts, small",
you'd have
"I want two black in medium and three white, small"
How would you define the relations here? I know that there are two "hidden" t-shirt entities, which have directed relations to the attributes. But the grouping black+medium and white+small isn't really directional. Is there any non-directed relationship in spaCy?
Well, I think you might have luck with text classification here. It might not work, but it's worth a try.
I would code the above with something like IN_ORDER: the attributes match up to the products such that the first attribute matches the first product, and the second attribute matches the second product.
Here's a simple and general coding scheme you could try. Assign the products numbers according to their order of occurrence. So in your example above, you'll have 1. black t-shirts and 2. white jeans. Now assign each attribute the number of the product it applies to. So the label to assign to your input is 1,2 above.
You're probably not going to have that many products per sentence, which means the range of possible structures to predict might be pretty limited. Maybe this won't work --- but maybe it will.
Another option is to use rules to get all the really easy cases, like cases where the attribute is directly attached to the product in the dependency tree. That might do really well, and then you can think of how to use models to get the more difficult cases only.
It could be that these approaches really don't work and you do need to use relation parsing -- but predicting trees is hard work, and the methods for it are really designed for hard cases like syntax where the space of trees is completely enormous. If the space of trees is actually very small, you might benefit from just predicting over the possible tree shapes directly.
Thanks @honnibal, that's very helpful. I see how this approach might make sense, and it also makes it a lot easier to use Prodigy for. I think I'll give it a go!