First of all congratulations with v1.9. The docs and the changelog looks amazing
I have a few questions. Lets say I have input stating
Revenue grew 10% to EUR 5 billion a rise from previous year where the revenue amounted to EUR 4 billion.
and first of all I want to locate this years revenue, i.e. EUR 5 billion. I could try to define an entity THIS_YEAR_REVENUE but I think it would perform poorly since it would probably also catch the previous years revenue. Quoting the docs (by default I think spaCy looks on four preceding tokens and four following!?):
A good way to think about how easy the model will find the task is to imagine you had to look at only the first word of the entity, with no context. How accurately would you be able to tell how that word should be labelled? Now imagine you had one word of context on either side, and ask yourself the same question.
Instead I could frame this task as a text classification problem.
If you find that annotators can’t agree on exactly which words should be tagged as an entity, that’s probably a sign that you’re trying to mark something that’s a bit too semantic, in which case text classification would be a better approach.
That is what you show here, right? I.e. the example with CORRECT label. So my task would look like
{
"text": "Revenue grew 10% to EUR 5 billion a rise from previous year where the revenue amounted to EUR 4 billion.",
"label": "CORRECT",
"spans": [{ "start": 5, "end": 7, "label": "THIS_YEAR_REVENUE" }]
}
But my question is then, how do you actually train such a classifier? I could use patterns to extract all candidates, i.e. EUR 5 billion and previous years EUR 4 billion but how do I then classify each of those? I mean my input is not just text but also a span.
my small contribute to your problem. I would try to proceed like this:
train a TEXTCAT to detect correctly sentences like "Revenue grew 10% to EUR 5 billion a rise from previous year where the revenue amounted to EUR 4 billion." where you know there is the information you need.
train a NER to detect revenue enties like: "EUR 5 billion" , "EUR 4 billion".
Use dependecy parsing to frame the meaning of your revenue entities, may be chaining it with other NER models to detect class of entities that are important (e.g. "previuos year")
It sounds a lot like my approach indeed and my main question is regarding point 3 really. I have considered using the dependency parser. However I'd like to experiment with a "smart" ML approach that simply learns to choose the correct span out of list of candidates instead of depending on logic based off the dependency parser. But I suppose I should come up with a custom classifier with a different feature space than "just text". But it could be there was a way to do it using spaCy.
@nix411 We don't really have a solution for arbitrary structure prediction in spaCy at the moment, especially if you want arbitrary features. spaCy's really set up for reuseable components --- for task-specific models, we can't really offer anything better than PyTorch for defining the model.
So I would suggest using PyTorch to define the model, which you could wrap as a spaCy pipeline component, setting extension attributes with the outputs.
Cool - I'll try to do that. Do you have any suggestions how to model it though? My initial thought was simply to merge doc.vector with ent.vector as the feature space and have CORRECT and INCORRECT as labels and then run a simple classifier (e.g. neural network or even xgboost) but maybe you have a better deep learning approach/suggestion (if its not too much to ask)?
I would probably try looking for a PyTorch tutorial or code-base that solves a problem with a similar type of output. It's probably going to be easier to work from that and get the code running without spaCy. You could also have a look at the Allen NLP library --- they have some semantic parsing components.