Convert output of spaCy PhraseMatcher to prodigy JSONL

I am trying to train an NER model (to detect parasites). I have a good amount of meta-data for my documents, and from these have gathered a list of parasite names. I use the matcher function in spaCy to identify these in the documents:

paras = parasite_names_from_meta_data

matcher = PhraseMatcher(nlp.vocab, attr="LOWER") 
terms = set(paras)
patterns = [nlp.make_doc(text) for text in terms if type(text)!=float]
matcher.add("TerminologyList", patterns)    

train_data = []
    for doc in docs:
    	matches =  matcher(doc, as_spans=True)
    	matches = spacy.util.filter_spans(matches) 
    	entity_offsets = {"entities": [[span.start_char, span.end_char, "PARASITE"] for span in matches]}

I want to use Prodigy to A) validate these matches to make sure the name mentioned is actually a parasite in this contex, and B) to train / update an NER model with these data.

This is likely a simple hurdle, but I can't seem to find any examples of converting data to the JSONL format expected by Prodigy. It would be great to use the existing training data have made (via above script), or I could maybe just use these as patterns directly within Prodigy, but either way I can't seem to find an example to convert data to the proper JSONL format. Is there a function or package I can follow? (This is also the first time I'm working with JSON files)

Hi! The code you have pretty much already contains everything you need :blush: Prodigy expects the spans to be a list of dictionaries with the "start" (start offset), "end" (end offset) and "label" (also see here for an example). So in your case, that would be:

spans = [{"start": span.start_char, "end": span.end_char, "label": "PARASITE"} for span in matches]
train_data.append({"text": doc.text, "spans": spans})

For exporting JSONL, we have some utility functions in our library srsly. Ultimately, JSONL is just a file with one line of JSON that can be read in line-by-line. You can also use JSON instead (it might just be slower for larger corpora because the whole file needs to be read into memory and parsed first).

import srsly
examples = [...]  # list of dictionaries here
srsly.write_jsonl("/path/to/file.jsonl", examples)

Yes, if the recipe you're using supports patterns (e.g. ner.manual), that might be an even easier option! See here for the expected format, which is pretty straightforward: Your patterns.jsonl file could then look like this (sorry, I don't know enough about parasites to come up with reasonable examples :sweat_smile:):

{"label": "PARASITE", "pattern": "some parasite name"}
{"label": "PARASITE", "pattern": "some other parasite name"}

The "pattern" can either be a string (for phrase matches) or a token-based pattern (in the format of spaCy's Matcher). Note that Prodigy currently doesn't have an option for specifying the phrase matcher attr to match on (e.g. attr="LOWER" like in your example). So you might have to add multiple versions of the pattern.

Hi @ines ! This is fantastic ~ thank you for taking the time to spell it out (and for the srsly package).

With this I can easily load my PhrasemMatcher annotations into prodigy and review them using ner.manual :slight_smile:

One issue I've just realized after all this is that prodigy doesn't seem to be compatible with the language models I'm using (scispacy v0.4.0, built on spaCy v3). I guess I'll have to wait until spaCy 3 compatibility is implemented to use these models for things like ner.teach, and the prodigy training pipeline?

Hi Max,

That's right - the current Prodigy version v1.10.x requires spaCy v2.x and is not compatible with spaCy v3.x. We are working on Prodigy nightly that will be compatible with spaCy v3.x and will be released as v.1.11.x when ready :slight_smile: