How to incorporate document metadata?


(Hugo Duncan) #1


We have a space pipeline that processes messages from a chats. We use a document per message and have Doc extensions to pass in extra information that is used in our pipeline components, e.g with an identifier for each chat, the names of the participants, who sent the chat message, etc. We then have to separately call Language.make_doc to create a Doc instance, add the required data, and then thread the document through a loop over the pipeline components (is there a better way of handling this in Spacy?).

I’m wondering how we can use Prodigy with this pipeline, as I see no way to pass in the information required. I see the Prodigy meta field, but don’t think that does what we need.

Any suggestions?

(Ines Montani) #2

You could use Language.pipe and set as_tuples=True, to process the documents as (text, context) tuples. For example:

messages = [('This is a message', {'user_id': 123})]
for doc, meta in nlp.pipe(messages, as_tuples=True):
    # do something with the docs and meta, e.g. add it to a custom attribute
    doc._.meta_data = meta

Annotation tasks are simple dictionaries that can be structured however you like. Depending on the recipe and annotation interface you want to use, Prodigy expects some keys to be present (for example, "text", "spans" or "label") – but aside from that, you can freely add your own fields and data that should be stored with the task. For example:

{"text": "This is a message", "user_id": 123}

When you annotate the task, the annotations are added and its saved to the database, together with your custom fields.

The "meta" field is mostly intended for metadata that should be displayed to the annotator (in the bottom right corner of the annotation card). You could consider adding some of your metadata here as well, especially in the early development / exploration phase. If you come across strange or interesting examples, it’s often good to know their context.

(Hugo Duncan) #3

Thank you for your quick reply, Ines. I hadn’t seen as_tuples, so thank you for pointing it out. However I don’t think it does quite what we need, as the context isn’t available to the pipeline components when processing a document, if I understand correctly.

Similarly the custom fields in the task are not available in the pipeline components executed by prodigy as part of the spacy model, as far as I can tell.

The use case here is to get references to chat participants annotated with PARTICIPANT labels. Knowing the names of the participants obviously makes this task much easier. We could resort to using a global data structure, but that would break Spacy’s nice model packaging, and wouldn’t work with Prodigy.

(Ines Montani) #4

Ah, sorry – I didn’t realise you needed the metadata to be available for other pipeline components. In that case, a solution could be to use your own tokenizer and make_doc function that takes both the text and the metadata, for example as a JSON-serializable object:

def custom_tokenizer(tokenizer):
    def make_doc(data):
        doc = tokenizer(data['text'])  # tokenize the text
        doc._.metadata = data           # do something with the meta
        return doc
    return make_doc

nlp.tokenizer = custom_tokenizer(nlp.tokenizer)

This would let you call nlp on a dictionary instead of a string. The tokenizer then takes care of extracting the text, tokenizing it and setting the metadata as an attribute on the newly constructed Doc.

doc = nlp({'text': 'This is a message', 'user_id': 123})

You can also package your tokenizer and your other custom pipeline components with your model by including them in the model package’s If your model’s load() method returns an initialised Language class, you’ll be able to load it like any other spaCy model.

If you do end up modifying the tokenizer to allow your Language object to take a dictionary, you will need to customise a few things in your Prodigy recipes – ultimately, this depends on what exactly you’re trying to annotate and how you want to present the annotation tasks.

(Hugo Duncan) #5

Sounds like a solution. Thank you Ines!