Apologies if I wasn't clear! I'm trying to get the same exact tags through SpaCy Matcher to those provided by a prodigy recipe.
For example, in SpaCy I'm doing:
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
pattern_np = [
{"POS": "DET", "TAG": {"NOT_IN": ["PRP$"]}, "OP": "?"},
{"POS": "ADJ", "OP": "*"},
{
"POS": {"IN": ["PROPN", "NOUN"]},
"OP": "+",
"ENT_TYPE": {"NOT_IN": ["PERSON", "ORG"]},
},
]
pattern_prn = [{'POS':'PRON'}]
matcher.add("NP", [pattern_np])
matcher.add('PRN', [pattern_prn])
I copied the NP pattern from the recipes/coref.py file as suggested in another post.
Now for example:
doc = nlp("Eddie fought the wild rabid wolves with Pickles the husky. He was a good boy.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(string_id, start, end, span.text)
for ent in doc.ents:
start = doc.ents[0].start
end = doc.ents[0].end
span = doc[start:end]
print('ENT({})'.format(doc.ents[0].label_), start, end, ent)
Which gives me as output:
NP 2 6 the wild rabid wolves
NP 3 6 wild rabid wolves
NP 4 6 rabid wolves
NP 5 6 wolves
NP 7 8 Pickles
PRN 11 12 He
NP 13 16 a good boy
NP 14 16 good boy
NP 15 16 boy
ENT(PERSON) 0 1 Eddie
In here I notice that if there are nested NP spans, the NP that Prodigy gives me is the longest NP span (eg. the wild rabid wolves). Also, I might be missing some POS patterns that are not explicit in coref.py.
Ideally I'd like to have the exact Match patterns to those of a recipe like:
prodigy coref.manual coref_movies en_core_web_lg ./plot_summaries.jsonl --label COREF`
Which gives me:
So I was just wondering if there's some functionality to obtain the Matcher used in a recipe.