It’s fine to ask this here :). One of the reason we wanted to build on our own technologies is that it’s not very satisfying to ask people to raise the issue upstream — we want to be able to just answer the question.
As far as actually computing the position feature, you might want to base it off the token.idx
attribute, and then discretise this somehow. Probably bins like 0-10, 10-30, 30-60, 60+ might be good. Additional bins for regions near the end might also be useful. If you hate arbitrary things like this, you can use information theory to find more principled cutting points – you want to find cuts that maximise the information gain with respect to the entity labels. I wouldn’t bother doing that though.
I can think of three pretty different ways you could get the model to take advantage of the position features. Unfortunately they’re all somewhat complicated, and I’m in Sydney until early January, so can only offer limited assistance. The three options are:
-
Add a feature to the word representations.
-
Add a feature to the state representations.
-
Create a linear model, and make the scores linear(input) + neural(input)
. This way you can add whatever features you want to the linear model.
-
Create a multi-task objective, so the model is trained to predict the position given the input.
Number 4 is the fancy new neural way of going about things. The idea is a bit counter-intuitive. Instead of incorporating the position information as a feature, you encourage the model to learn that the start of the document looks different from the rest of the document. The knowledge required to make that distinction gets put into the word representations, which are shared between the position prediction model and the entity prediction model. At run-time, you throw away the position prediction model, and just use the word representations.
Have a look at the init_multitask_objectives
method of EntityRecognizer
in https://github.com/explosion/spaCy/blob/master/spacy/pipeline.pyx . If you subclass this class, you can have it create a subclass of MultitaskObjective
that tries to predict a position feature.
I’m planning to build hooks to make 3 much easier, because I think this will be useful for the syntactic parser. It’s also my plan to address the current speed problems: make the model a linear and neural ensemble, and let people switch off the neural component if they want the parser to be faster. So I’ll defer discussion of this for now.
Options 2 and 1 will be easiest to do if you build from source at first, while you experiment. Once you’ve finished experimenting, you can work out the neatest way to subclass everything to make your changes.
The part of the code to modify to do 1 is the Tok2Vec function in spacy/_ml.py
. You need to write a function like this signature:
from thinc.api import layerize, chain, flatten, FeatureExtracter
from thinc.i2v import Embed
def discretize(bins):
def discretize_fwd(positions):
# This should take a single array of integer positions, and return a single array of uint64 feature IDs
return discretized
return layerize(discretize_fwd)
def EmbedPositions(widths, bins):
with Model.define_operators({'>>': chain}):
model = (
FeatureExtracter([IDX])
>> with_flatten(
discetize(bins)
>> Embed(width, nV=len(bins))
)
)
return model
The idea is to create the feature extracter that takes Doc
objects, and returns a list of arrays with the position vectors. This list of vectors can then be used to as part of the Tok2Vec
function. You could put the position vectors before or after the convolutions. I’m not sure which would work best.
Strategy 2 — extending the state representation — is currently difficult, because the state representation assumes that all of the values refer to tokens.