Translating recipe tags to a Spacy custom pipeline component

I was wondering if there is some simple way to obtain the same tags provided by a Prodigy recipe through a custom pipeline component in Spacy? I'm guessing this is a process that happens under the hood in Prodigy.

For example, say I'd like to have a Spacy custom pipeline component that gave me the same initial tags that are given by, using one of your recipe examples,

prodigy coref.manual coref_movies en_core_web_sm ./plot_summaries.jsonl --label COREF

Thank you

Hi! I'm not 100% sure I understand the question correctly, do you have a more specific example? Do you want to train a pipeline component that assigns coreference relationships annotated in Prodigy?

Apologies if I wasn't clear! I'm trying to get the same exact tags through SpaCy Matcher to those provided by a prodigy recipe.

For example, in SpaCy I'm doing:

 nlp = spacy.load("en_core_web_lg")
 matcher = Matcher(nlp.vocab)
 pattern_np = [
             {"POS": "DET", "TAG": {"NOT_IN": ["PRP$"]}, "OP": "?"},
             {"POS": "ADJ", "OP": "*"},
             {
                 "POS": {"IN": ["PROPN", "NOUN"]},
                 "OP": "+",
                 "ENT_TYPE": {"NOT_IN": ["PERSON", "ORG"]},
             },
         ]
 pattern_prn = [{'POS':'PRON'}]
 
 matcher.add("NP", [pattern_np])
 matcher.add('PRN', [pattern_prn])

I copied the NP pattern from the recipes/coref.py file as suggested in another post.

Now for example:

 doc = nlp("Eddie fought the wild rabid wolves with Pickles the husky. He was a good boy.")
 
 matches = matcher(doc)
 for match_id, start, end in matches:
     string_id = nlp.vocab.strings[match_id]  # Get string representation
     span = doc[start:end]  # The matched span
     print(string_id, start, end, span.text)
 
 for ent in doc.ents:
   start = doc.ents[0].start
   end = doc.ents[0].end
   span = doc[start:end]
   print('ENT({})'.format(doc.ents[0].label_), start, end, ent)

Which gives me as output:

 NP 2 6 the wild rabid wolves
 NP 3 6 wild rabid wolves
 NP 4 6 rabid wolves
 NP 5 6 wolves
 NP 7 8 Pickles
 PRN 11 12 He
 NP 13 16 a good boy
 NP 14 16 good boy
 NP 15 16 boy
 ENT(PERSON) 0 1 Eddie

In here I notice that if there are nested NP spans, the NP that Prodigy gives me is the longest NP span (eg. the wild rabid wolves). Also, I might be missing some POS patterns that are not explicit in coref.py.

Ideally I'd like to have the exact Match patterns to those of a recipe like:

prodigy coref.manual coref_movies en_core_web_lg ./plot_summaries.jsonl --label COREF`

Which gives me:

So I was just wondering if there's some functionality to obtain the Matcher used in a recipe.

Ahhh okay, thanks for the clarification! :+1:

The pattern defined in coref.py is the only pattern used by the workflow. By default, the recipe also includes named entities (so everything in the doc.ents) and optionally, noun chunks (everything in doc.noun_chunks).

The filter_spans helper lets you filter a list of spans to only keep the (first) longest matches. This is also what Prodigy uses under the hood. Together, this should give you the same results as the coref recipe suggestion :slightly_smiling_face:

1 Like

Great, thank you!