We have a space pipeline that processes messages from a chats. We use a document per message and have Doc extensions to pass in extra information that is used in our pipeline components, e.g with an identifier for each chat, the names of the participants, who sent the chat message, etc. We then have to separately call Language.make_doc to create a Doc instance, add the required data, and then thread the document through a loop over the pipeline components (is there a better way of handling this in Spacy?).
I’m wondering how we can use Prodigy with this pipeline, as I see no way to pass in the information required. I see the Prodigy meta field, but don’t think that does what we need.
You could use Language.pipe and set as_tuples=True, to process the documents as (text, context) tuples. For example:
messages = [('This is a message', {'user_id': 123})]
for doc, meta in nlp.pipe(messages, as_tuples=True):
# do something with the docs and meta, e.g. add it to a custom attribute
doc._.meta_data = meta
Annotation tasks are simple dictionaries that can be structured however you like. Depending on the recipe and annotation interface you want to use, Prodigy expects some keys to be present (for example, "text", "spans" or "label") – but aside from that, you can freely add your own fields and data that should be stored with the task. For example:
{"text": "This is a message", "user_id": 123}
When you annotate the task, the annotations are added and its saved to the database, together with your custom fields.
The "meta" field is mostly intended for metadata that should be displayed to the annotator (in the bottom right corner of the annotation card). You could consider adding some of your metadata here as well, especially in the early development / exploration phase. If you come across strange or interesting examples, it's often good to know their context.
Thank you for your quick reply, Ines. I hadn’t seen as_tuples, so thank you for pointing it out. However I don’t think it does quite what we need, as the context isn’t available to the pipeline components when processing a document, if I understand correctly.
Similarly the custom fields in the task are not available in the pipeline components executed by prodigy as part of the spacy model, as far as I can tell.
The use case here is to get references to chat participants annotated with PARTICIPANT labels. Knowing the names of the participants obviously makes this task much easier. We could resort to using a global data structure, but that would break Spacy’s nice model packaging, and wouldn’t work with Prodigy.
Ah, sorry – I didn’t realise you needed the metadata to be available for other pipeline components. In that case, a solution could be to use your own tokenizer and make_doc function that takes both the text and the metadata, for example as a JSON-serializable object:
def custom_tokenizer(tokenizer):
def make_doc(data):
doc = tokenizer(data['text']) # tokenize the text
doc._.metadata = data # do something with the meta
return doc
return make_doc
nlp.tokenizer = custom_tokenizer(nlp.tokenizer)
This would let you call nlp on a dictionary instead of a string. The tokenizer then takes care of extracting the text, tokenizing it and setting the metadata as an attribute on the newly constructed Doc.
doc = nlp({'text': 'This is a message', 'user_id': 123})
You can also package your tokenizer and your other custom pipeline components with your model by including them in the model package’s __init__.py. If your model’s load() method returns an initialised Language class, you’ll be able to load it like any other spaCy model.
If you do end up modifying the tokenizer to allow your Language object to take a dictionary, you will need to customise a few things in your Prodigy recipes – ultimately, this depends on what exactly you’re trying to annotate and how you want to present the annotation tasks.
Hi @ines I would like to apply a NER model to examples (not annotated) in JSONL format. In someway I can get the the example tagged from DB as:
import spacy
from prodigy.components.db import connect
DB = connect()
examples = DB.get_dataset("t4")
examples = [eg for eg in examples if eg['answer'] == 'accept']
nlp = spacy.load('model_T_2_1')
ex=[]
examples_tag=[]
for i in range(len(examples)):
ex.append(examples[i]["text"])
for doc in nlp.pipe(ex):
# get all existing entity spans with start, end and label
spans = [{'start': ent.start_char, 'end': ent.end_char,
'label': ent.label_,'text':ent.text} for ent in doc.ents]
examples_tag.append({'text': doc.text, 'spans': spans})
Q1: There is an prodigy utility to apply the model to JSONL?
Q2: I would like to get the also meta from JSONL into the tagged examples, but it is not clear for me how to get it.
It shows how to enqueue examples and feed them through the NER model, for annotation.
For your second question, do you simply want the metadata to be displayed to the annotator? You can do that by adding a meta key to the examples, e.g. {"text": "some text", "meta": {"key": "value"}}. This won’t use the metadata as a feature in the model, though. There’s not really a good way to do that currently.