How to incorporate document metadata?

hugoduncan · February 8, 2018, 1:53pm

Hi,

We have a space pipeline that processes messages from a chats. We use a document per message and have Doc extensions to pass in extra information that is used in our pipeline components, e.g with an identifier for each chat, the names of the participants, who sent the chat message, etc. We then have to separately call Language.make_doc to create a Doc instance, add the required data, and then thread the document through a loop over the pipeline components (is there a better way of handling this in Spacy?).

I’m wondering how we can use Prodigy with this pipeline, as I see no way to pass in the information required. I see the Prodigy meta field, but don’t think that does what we need.

Any suggestions?

ines · February 8, 2018, 2:26pm

You could use Language.pipe and set as_tuples=True, to process the documents as (text, context) tuples. For example:

messages = [('This is a message', {'user_id': 123})]
for doc, meta in nlp.pipe(messages, as_tuples=True):
    # do something with the docs and meta, e.g. add it to a custom attribute
    doc._.meta_data = meta

Annotation tasks are simple dictionaries that can be structured however you like. Depending on the recipe and annotation interface you want to use, Prodigy expects some keys to be present (for example, "text", "spans" or "label") – but aside from that, you can freely add your own fields and data that should be stored with the task. For example:

{"text": "This is a message", "user_id": 123}

When you annotate the task, the annotations are added and its saved to the database, together with your custom fields.

The "meta" field is mostly intended for metadata that should be displayed to the annotator (in the bottom right corner of the annotation card). You could consider adding some of your metadata here as well, especially in the early development / exploration phase. If you come across strange or interesting examples, it's often good to know their context.

hugoduncan · February 8, 2018, 3:37pm

Thank you for your quick reply, Ines. I hadn’t seen as_tuples, so thank you for pointing it out. However I don’t think it does quite what we need, as the context isn’t available to the pipeline components when processing a document, if I understand correctly.

Similarly the custom fields in the task are not available in the pipeline components executed by prodigy as part of the spacy model, as far as I can tell.

The use case here is to get references to chat participants annotated with PARTICIPANT labels. Knowing the names of the participants obviously makes this task much easier. We could resort to using a global data structure, but that would break Spacy’s nice model packaging, and wouldn’t work with Prodigy.

ines · February 8, 2018, 4:16pm

Ah, sorry – I didn’t realise you needed the metadata to be available for other pipeline components. In that case, a solution could be to use your own tokenizer and make_doc function that takes both the text and the metadata, for example as a JSON-serializable object:

def custom_tokenizer(tokenizer):
    def make_doc(data):
        doc = tokenizer(data['text'])  # tokenize the text
        doc._.metadata = data           # do something with the meta
        return doc
    return make_doc

nlp.tokenizer = custom_tokenizer(nlp.tokenizer)

This would let you call nlp on a dictionary instead of a string. The tokenizer then takes care of extracting the text, tokenizing it and setting the metadata as an attribute on the newly constructed Doc.

doc = nlp({'text': 'This is a message', 'user_id': 123})

You can also package your tokenizer and your other custom pipeline components with your model by including them in the model package’s __init__.py. If your model’s load() method returns an initialised Language class, you’ll be able to load it like any other spaCy model.

If you do end up modifying the tokenizer to allow your Language object to take a dictionary, you will need to customise a few things in your Prodigy recipes – ultimately, this depends on what exactly you’re trying to annotate and how you want to present the annotation tasks.

hugoduncan · February 8, 2018, 4:34pm

Sounds like a solution. Thank you Ines!

Cristiano74 · October 30, 2018, 5:02pm

Hi @ines I would like to apply a NER model to examples (not annotated) in JSONL format. In someway I can get the the example tagged from DB as:

import spacy
from prodigy.components.db import connect
DB = connect()
examples = DB.get_dataset("t4")
examples = [eg for eg in examples if eg['answer'] == 'accept']
nlp = spacy.load('model_T_2_1')
ex=[]
examples_tag=[]
for i in range(len(examples)):
    ex.append(examples[i]["text"])
for doc in nlp.pipe(ex):
    # get all existing entity spans with start, end and label
   
    spans = [{'start': ent.start_char, 'end': ent.end_char,
               'label': ent.label_,'text':ent.text} for ent in doc.ents]
    examples_tag.append({'text': doc.text, 'spans': spans})

Q1: There is an prodigy utility to apply the model to JSONL?
Q2: I would like to get the also meta from JSONL into the tagged examples, but it is not clear for me how to get it.

Thanks in advance for any suggestions

All the best

C.

honnibal · October 31, 2018, 12:41pm

Hi @Cristiano74,

You might find the logic in this recipe:

github.com

explosion/prodigy-recipes/blob/master/ner/ner_make_gold.py

# coding: utf8
from __future__ import unicode_literals

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string, set_hashes
import spacy
import copy


def make_tasks(nlp, stream, labels):
    """Add a 'spans' key to each example, with predicted entities."""
    # Process the stream using spaCy's nlp.pipe, which yields doc objects.
    # If as_tuples=True is set, you can pass in (text, context) tuples.
    texts = ((eg['text'], eg) for eg in stream)
    for doc, eg in nlp.pipe(texts, as_tuples=True):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:

This file has been truncated. show original

It shows how to enqueue examples and feed them through the NER model, for annotation.

For your second question, do you simply want the metadata to be displayed to the annotator? You can do that by adding a meta key to the examples, e.g. {"text": "some text", "meta": {"key": "value"}}. This won’t use the metadata as a feature in the model, though. There’s not really a good way to do that currently.

Cristiano74 · November 5, 2018, 2:08pm

Thanks a lot @honnibal, I’ve just add the code for applying model to JSONL file here: https://gist.github.com/cristiano74/d62b741351fe9508d209bb4b82faf1d6

I hope it could be useful for all the newbies as me.

All the best

C.

nix411 · February 24, 2019, 5:38pm

Hi @ines

Do you have an example of that implementation somewhere? Sounds like what I'd need, thanks.

ines · February 24, 2019, 6:31pm

If you use the spacy package command, it'll output all files you need for the Python package, including the templates for the setup.py and __init__.py. You can also find more info and examples here: Training Pipelines & Models · spaCy Usage Documentation

Topic		Replies	Views
Incorporate Document metadata in built-in recipes usage , spacy	1	233	December 1, 2022
Including document-level, non-textual metadata in model training usage , textcat	1	603	December 5, 2019
prodigy data-to-spacy - retain metadata information enhancement , spacy	3	493	April 27, 2021
Adding custom spacy pipelines and/or medspacy	1	19	March 14, 2025
How to use customized spaCy model in Prodigy? ner , spacy	6	491	July 3, 2023

How to incorporate document metadata?

Related topics