How to incorporate document metadata?

Hi,

We have a space pipeline that processes messages from a chats. We use a document per message and have Doc extensions to pass in extra information that is used in our pipeline components, e.g with an identifier for each chat, the names of the participants, who sent the chat message, etc. We then have to separately call Language.make_doc to create a Doc instance, add the required data, and then thread the document through a loop over the pipeline components (is there a better way of handling this in Spacy?).

I’m wondering how we can use Prodigy with this pipeline, as I see no way to pass in the information required. I see the Prodigy meta field, but don’t think that does what we need.

Any suggestions?

You could use Language.pipe and set as_tuples=True, to process the documents as (text, context) tuples. For example:

messages = [('This is a message', {'user_id': 123})]
for doc, meta in nlp.pipe(messages, as_tuples=True):
    # do something with the docs and meta, e.g. add it to a custom attribute
    doc._.meta_data = meta

Annotation tasks are simple dictionaries that can be structured however you like. Depending on the recipe and annotation interface you want to use, Prodigy expects some keys to be present (for example, "text", "spans" or "label") – but aside from that, you can freely add your own fields and data that should be stored with the task. For example:

{"text": "This is a message", "user_id": 123}

When you annotate the task, the annotations are added and its saved to the database, together with your custom fields.

The "meta" field is mostly intended for metadata that should be displayed to the annotator (in the bottom right corner of the annotation card). You could consider adding some of your metadata here as well, especially in the early development / exploration phase. If you come across strange or interesting examples, it's often good to know their context.

Thank you for your quick reply, Ines. I hadn’t seen as_tuples, so thank you for pointing it out. However I don’t think it does quite what we need, as the context isn’t available to the pipeline components when processing a document, if I understand correctly.

Similarly the custom fields in the task are not available in the pipeline components executed by prodigy as part of the spacy model, as far as I can tell.

The use case here is to get references to chat participants annotated with PARTICIPANT labels. Knowing the names of the participants obviously makes this task much easier. We could resort to using a global data structure, but that would break Spacy’s nice model packaging, and wouldn’t work with Prodigy.

Ah, sorry – I didn’t realise you needed the metadata to be available for other pipeline components. In that case, a solution could be to use your own tokenizer and make_doc function that takes both the text and the metadata, for example as a JSON-serializable object:

def custom_tokenizer(tokenizer):
    def make_doc(data):
        doc = tokenizer(data['text'])  # tokenize the text
        doc._.metadata = data           # do something with the meta
        return doc
    return make_doc

nlp.tokenizer = custom_tokenizer(nlp.tokenizer)

This would let you call nlp on a dictionary instead of a string. The tokenizer then takes care of extracting the text, tokenizing it and setting the metadata as an attribute on the newly constructed Doc.

doc = nlp({'text': 'This is a message', 'user_id': 123})

You can also package your tokenizer and your other custom pipeline components with your model by including them in the model package’s __init__.py. If your model’s load() method returns an initialised Language class, you’ll be able to load it like any other spaCy model.

If you do end up modifying the tokenizer to allow your Language object to take a dictionary, you will need to customise a few things in your Prodigy recipes – ultimately, this depends on what exactly you’re trying to annotate and how you want to present the annotation tasks.

Sounds like a solution. Thank you Ines!

Hi @ines I would like to apply a NER model to examples (not annotated) in JSONL format. In someway I can get the the example tagged from DB as:

import spacy
from prodigy.components.db import connect
DB = connect()
examples = DB.get_dataset("t4")
examples = [eg for eg in examples if eg['answer'] == 'accept']
nlp = spacy.load('model_T_2_1')
ex=[]
examples_tag=[]
for i in range(len(examples)):
    ex.append(examples[i]["text"])
for doc in nlp.pipe(ex):
    # get all existing entity spans with start, end and label
   
    spans = [{'start': ent.start_char, 'end': ent.end_char,
               'label': ent.label_,'text':ent.text} for ent in doc.ents]
    examples_tag.append({'text': doc.text, 'spans': spans})

Q1: There is an prodigy utility to apply the model to JSONL?
Q2: I would like to get the also meta from JSONL into the tagged examples, but it is not clear for me how to get it.

Thanks in advance for any suggestions

All the best

C.

Hi @Cristiano74,

You might find the logic in this recipe:

It shows how to enqueue examples and feed them through the NER model, for annotation.

For your second question, do you simply want the metadata to be displayed to the annotator? You can do that by adding a meta key to the examples, e.g. {"text": "some text", "meta": {"key": "value"}}. This won’t use the metadata as a feature in the model, though. There’s not really a good way to do that currently.

Thanks a lot @honnibal, I’ve just add the code for applying model to JSONL file here: https://gist.github.com/cristiano74/d62b741351fe9508d209bb4b82faf1d6

I hope it could be useful for all the newbies as me. :v:

All the best

C.

1 Like

Hi @ines

Do you have an example of that implementation somewhere? Sounds like what I'd need, thanks.

If you use the spacy package command, it'll output all files you need for the Python package, including the templates for the setup.py and __init__.py. You can also find more info and examples here: Training Pipelines & Models · spaCy Usage Documentation

1 Like