Convert output of spaCy PhraseMatcher to prodigy JSONL

maxfarrell · April 27, 2021, 5:12pm

I am trying to train an NER model (to detect parasites). I have a good amount of meta-data for my documents, and from these have gathered a list of parasite names. I use the matcher function in spaCy to identify these in the documents:

paras = parasite_names_from_meta_data

matcher = PhraseMatcher(nlp.vocab, attr="LOWER") 
terms = set(paras)
patterns = [nlp.make_doc(text) for text in terms if type(text)!=float]
matcher.add("TerminologyList", patterns)    

train_data = []
    for doc in docs:
    	matches =  matcher(doc, as_spans=True)
    	matches = spacy.util.filter_spans(matches) 
    	entity_offsets = {"entities": [[span.start_char, span.end_char, "PARASITE"] for span in matches]}
    	train_data.append([
    		doc,
    		entity_offsets
    		])

I want to use Prodigy to A) validate these matches to make sure the name mentioned is actually a parasite in this contex, and B) to train / update an NER model with these data.

This is likely a simple hurdle, but I can't seem to find any examples of converting data to the JSONL format expected by Prodigy. It would be great to use the existing training data have made (via above script), or I could maybe just use these as patterns directly within Prodigy, but either way I can't seem to find an example to convert data to the proper JSONL format. Is there a function or package I can follow? (This is also the first time I'm working with JSON files)

ines · April 28, 2021, 1:19am

Hi! The code you have pretty much already contains everything you need Prodigy expects the spans to be a list of dictionaries with the "start" (start offset), "end" (end offset) and "label" (also see here for an example). So in your case, that would be:

spans = [{"start": span.start_char, "end": span.end_char, "label": "PARASITE"} for span in matches]
train_data.append({"text": doc.text, "spans": spans})

For exporting JSONL, we have some utility functions in our library srsly. Ultimately, JSONL is just a file with one line of JSON that can be read in line-by-line. You can also use JSON instead (it might just be slower for larger corpora because the whole file needs to be read into memory and parsed first).

import srsly
examples = [...]  # list of dictionaries here
srsly.write_jsonl("/path/to/file.jsonl", examples)

Yes, if the recipe you're using supports patterns (e.g. ner.manual), that might be an even easier option! See here for the expected format, which is pretty straightforward: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP Your patterns.jsonl file could then look like this (sorry, I don't know enough about parasites to come up with reasonable examples ):

{"label": "PARASITE", "pattern": "some parasite name"}
{"label": "PARASITE", "pattern": "some other parasite name"}

The "pattern" can either be a string (for phrase matches) or a token-based pattern (in the format of spaCy's Matcher). Note that Prodigy currently doesn't have an option for specifying the phrase matcher attr to match on (e.g. attr="LOWER" like in your example). So you might have to add multiple versions of the pattern.

maxfarrell · April 29, 2021, 3:04pm

Hi @ines ! This is fantastic ~ thank you for taking the time to spell it out (and for the srsly package).

With this I can easily load my PhrasemMatcher annotations into prodigy and review them using ner.manual

One issue I've just realized after all this is that prodigy doesn't seem to be compatible with the language models I'm using (scispacy v0.4.0, built on spaCy v3). I guess I'll have to wait until spaCy 3 compatibility is implemented to use these models for things like ner.teach, and the prodigy training pipeline?

SofieVL · May 3, 2021, 8:40pm

Hi Max,

That's right - the current Prodigy version v1.10.x requires spaCy v2.x and is not compatible with spaCy v3.x. We are working on Prodigy nightly that will be compatible with spaCy v3.x and will be released as v.1.11.x when ready

Topic		Replies	Views
Prodigy annotations to SpaCy train spacy	13	5622	January 31, 2018
Create PhraseMatcher in Spacy and use them to Label data manually ner , spacy , solved , medical	9	1580	December 15, 2020
Converting data to Prodigy's format Getting Started usage , ner	1	1567	December 5, 2018
Training prodigy ner data through spacy usage , ner , spacy , solved	3	894	January 8, 2020
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	596	June 15, 2020

Convert output of spaCy PhraseMatcher to prodigy JSONL

Related topics