Input pattern file to terms.teach

Hi guys!

Is it possible to add pattern files as an argument to terms.teach recipe? Used to see this being available in older versions of Prodigy, but don't see it anymore in current documentation.

I think the patterns would be useful to kickstart and narrow the search space.

Thanks! Jason

Hi Jason.

Do you remember the Prodigy version from before? I checked the changelog but couldn't find the change there.

That said, if you check the terms.teach recipe you should see a --seeds option that allows you to pass in strings but also a file with one term per line. Does this suffice? I can imagine that we don't support general patterns here because the terms.teach recipe assumes single token terms, while patterns could have multiple tokens.

Hi koaning,

Actually one token doesn't suffice. Would like to have multiple patterns generated from terms.teach as a pattern file for text classification.

Understand that we can use Sense2vec instead, however, the vector dictionary can I still use spacy's one? I am training Chinese text, so am using zh_core_web_lg, but encountered issues running.

Thanks!

The idea behind the sense2vec trick might work, but unfortunately the current pre-trained model only provides support for English.

I'm about to propose another trick that might work, but I want to be careful that I don't over-promise anything since I don't speak Chinese. But one thing that you might do is create a spaCy script that can fetch "chunks of tokens that might form a noun". For English, this is supported directly via Doc.noun_chunks but I believe this isn't supported for Chinese. You might, however, try to construct something similar by hand.

Here's how I might construct it for English using noun_chunks.

import spacy 

nlp = spacy.load("en_core_web_md")

doc = nlp("Pepperoni pizzas are an amazing Italian dish.")
for chunk in doc.noun_chunks:
    print(chunk)

# Pepperoni pizzas
# an amazing Italian dish

Here's another way of doing something similar without using the .noun_chunks property.

from spacy import displacy 

displacy.render(doc)

We could find chunks manually by looking for noun tokens that are a "root". That is to say, we're looking for nouns have that children in the dependency graph.

Here's a little script that can do that.

for tok in doc:
    if tok.pos_ == "NOUN":
        children = list(tok.children)
        if children:
            token_idx = [tok.i] + [t.i for t in children]
            print(doc[min(token_idx): max(token_idx) + 1])
# Pepperoni pizzas
# an amazing Italian dish

There are variants of this script you might consider, but this is a way to fetch multi-token chunks from your corpus. And this might also work for Chinese.

Here's an example I made using Google translate.

import spacy

nlp = spacy.load("zh_core_web_sm")
doc = nlp("意大利辣香肠比萨饼是一道很棒的意大利菜")

for tok in doc:
    if tok.pos_ == "NOUN":
        children = list(tok.children)
        if children:
            token_idx = [tok.i] + [t.i for t in children]
            print(doc[min(token_idx): max(token_idx) + 1])
# 意大利辣香肠
# 是一道很棒的意大利菜

From here, you might even be able to construct phrases with vectors from the spaCy pipeline.

import spacy

nlp = spacy.load("zh_core_web_sm")
doc = nlp("意大利辣香肠比萨饼是一道很棒的意大利菜")

phrases = {}
for tok in doc:
    if tok.pos_ == "NOUN":
        children = list(tok.children)
        if children:
            token_idx = [tok.i] + [t.i for t in children]
            phrase = doc[min(token_idx): max(token_idx) + 1]
            phrases[phrase.text] = phrase.vector[:5]

This will give a dictionary, phrases, that maps phrase texts to vectors.

{
   '意大利辣香肠': array([-0.217975  , -1.4146296 ,  1.3613806 , -0.09676328, -0.1946054, ...],dtype=float32),
   '是一道很棒的意大利菜': array([-0.19973822,  0.54612845,  0.09049363, -0.17543283,  0.31415954, ...], dtype=float32)
}

These vectors can then be used to find phrases that are similar. This would involve custom code, but might be worth a try.

Again, I really want to stress that I cannot judge if this will work for Chinese, and a lot of the utility will depend on the spaCy pipeline. But I'm mentioning it because the exercise does seem to be worth a try. Could you let me know if this direction does/not work for you? I am very much interested in hearing your reply!