Translating recipe tags to a Spacy custom pipeline component

lalopey · February 22, 2021, 1:32pm

I was wondering if there is some simple way to obtain the same tags provided by a Prodigy recipe through a custom pipeline component in Spacy? I'm guessing this is a process that happens under the hood in Prodigy.

For example, say I'd like to have a Spacy custom pipeline component that gave me the same initial tags that are given by, using one of your recipe examples,

prodigy coref.manual coref_movies en_core_web_sm ./plot_summaries.jsonl --label COREF

Thank you

ines · February 23, 2021, 11:21pm

Hi! I'm not 100% sure I understand the question correctly, do you have a more specific example? Do you want to train a pipeline component that assigns coreference relationships annotated in Prodigy?

lalopey · February 25, 2021, 6:56am

Apologies if I wasn't clear! I'm trying to get the same exact tags through SpaCy Matcher to those provided by a prodigy recipe.

For example, in SpaCy I'm doing:

 nlp = spacy.load("en_core_web_lg")
 matcher = Matcher(nlp.vocab)
 pattern_np = [
             {"POS": "DET", "TAG": {"NOT_IN": ["PRP$"]}, "OP": "?"},
             {"POS": "ADJ", "OP": "*"},
             {
                 "POS": {"IN": ["PROPN", "NOUN"]},
                 "OP": "+",
                 "ENT_TYPE": {"NOT_IN": ["PERSON", "ORG"]},
             },
         ]
 pattern_prn = [{'POS':'PRON'}]
 
 matcher.add("NP", [pattern_np])
 matcher.add('PRN', [pattern_prn])

I copied the NP pattern from the recipes/coref.py file as suggested in another post.

Now for example:

 doc = nlp("Eddie fought the wild rabid wolves with Pickles the husky. He was a good boy.")
 
 matches = matcher(doc)
 for match_id, start, end in matches:
     string_id = nlp.vocab.strings[match_id]  # Get string representation
     span = doc[start:end]  # The matched span
     print(string_id, start, end, span.text)
 
 for ent in doc.ents:
   start = doc.ents[0].start
   end = doc.ents[0].end
   span = doc[start:end]
   print('ENT({})'.format(doc.ents[0].label_), start, end, ent)

Which gives me as output:

 NP 2 6 the wild rabid wolves
 NP 3 6 wild rabid wolves
 NP 4 6 rabid wolves
 NP 5 6 wolves
 NP 7 8 Pickles
 PRN 11 12 He
 NP 13 16 a good boy
 NP 14 16 good boy
 NP 15 16 boy
 ENT(PERSON) 0 1 Eddie

In here I notice that if there are nested NP spans, the NP that Prodigy gives me is the longest NP span (eg. the wild rabid wolves). Also, I might be missing some POS patterns that are not explicit in coref.py.

Ideally I'd like to have the exact Match patterns to those of a recipe like:

prodigy coref.manual coref_movies en_core_web_lg ./plot_summaries.jsonl --label COREF`

Which gives me:

So I was just wondering if there's some functionality to obtain the Matcher used in a recipe.

ines · February 25, 2021, 10:58am

Ahhh okay, thanks for the clarification!

The pattern defined in coref.py is the only pattern used by the workflow. By default, the recipe also includes named entities (so everything in the doc.ents) and optionally, noun chunks (everything in doc.noun_chunks).

The filter_spans helper lets you filter a list of spans to only keep the (first) longest matches. This is also what Prodigy uses under the hood. Together, this should give you the same results as the coref recipe suggestion

lalopey · February 25, 2021, 11:02am

Great, thank you!

Topic		Replies	Views
Creating a custom recipe to integrate bespoke model usage , ner , custom , solved	3	720	November 12, 2019
No tagger in pre-trained models? coref	1	204	March 26, 2024
Problem to start coref.manual with new spacy model usage , solved , relations , coref	1	574	June 23, 2020
Using a custom component in NER done , spacy	4	1840	February 23, 2018
Can't get phrase matching to work spancat	3	295	June 27, 2023

Translating recipe tags to a Spacy custom pipeline component

Related topics