Best practice for post processing

I am implementing a custom component to set some attributes on my Doc, Span and Token following the same structure as in this example. I will probably have the custom component in its own package exploiting entry points like explained here.

My question is how do I add methods to my Doc? Like the post processing done in main() from the example referenced above? It could be a method like get_countries() - I suppose I could add it as an attribute as well but is this the best way to do it?

Yes, that’s pretty much exactly what the custom attributes with getters and custom pipeline components were designed for :+1:

But when you create the getter functions then you don’t have access to the Doc, only the Tokens, right? At least based on has_tech_org(self, tokens) in this example.

I guess I could just add self.doc as an attribute in the __call__() method? Is that the way to do it?

The getter function always receives the object it’s called on as its argument – so if you add a getter to the Doc, that function will receive the doc object:

def get_something(doc):
    return # something

Doc.set_extension('something', getter=get_something)

In the example you linked, I called the object tokens because the has_tech_org method is used for the Doc and the Span. Both have tokens you can iterate over, so we can reuse the method – but maybe the code would have been clearer here if I had called that argument obj for object.

Btw, if you’re adding extension attributes on tokens and spans and you need the parent Doc, it’s available as the token.doc and span.doc attribute :slightly_smiling_face:

1 Like