terms.to-patterns with existing data

Hello,

Second problem of the day :wink:

when i do :
python3.6 -m prodigy dataset materials “Banque de materiaux”
=> Successfully added ‘materials’ to database SQLite.
python3.6 -m prodigy terms.to-patterns materials ./materials.jsonl --label MATERIAL
=> Can’t find dataset ‘materials’

Why ? Any idea ?

Thanks

Ok hmmm, maybe i don’t really understand how it works.
i can see i don’t load my text file containing my materials in terms.to-patterns command.
can it be the problem ?
I think i dont understand how prodigy works.

Don’t worry – Prodigy introduces a lot of new concepts, so it’s totally fine if things are a little confusing at first. We hope we can help you get started, and we’re always working on improving the documentation!

terms.to-patterns is a recipe for converting a Prodigy dataset of texts from previous annotations to a patterns JSONL file. For example, you can use terms.teach to create a dataset of similar words from word vectors, and then convert that dataset to a patterns file.

If you already have the terms – for example, your materials – you can skip this step. You can also just create a patterns file yourself. How does you materials.jsonl look? Here’s an example of a patterns.jsonl file – maybe you can just convert your materials to this format?

{"label": "MATERIAL", "pattern": [{"lower": "steel"}]}
{"label": "MATERIAL", "pattern": [{"lower": "hard"}, {"lower": "wood"}]}

Thanks for your answer.
At the moment, my file is like that :

bois
béton
ciment
fer
aluminium

That’s all.

I was thinking terms.to-patterns could turn from txt file to jsonl file but i probably make a mistake :wink:

Yes, the terms.to-patterns is only for a Prodigy dataset. If you already have the terms in a file, you can just write a simple script yourself that converts them to a JSONL file like this:

{"label": "MATERIAL", "pattern": [{"lower": "bois"}]}
{"label": "MATERIAL", "pattern": [{"lower": "béton"}]}

You can then use your patterns.jsonl with the other recipes, like ner.teach :blush:

Hi,

Thanks a lot for your help.

Have a good day.

1 Like

Hi,
Can we have terms with “-” e.g.
{“pattern”:[{“lower”:“coca-cola”}],“label”:“Product”}.

Obviously the above is different from
{“pattern”:[{“lower”:“coca”},{“x”:“cola”}],“label”:“Product”}

In theory, yes – but keep in mind that the patterns always need to reflect spaCy's tokenization. If the tokenizer splits a hyphenated string, the pattern will never match.

If you want to test your patterns, you might find our Rule-based Matcher Explorer demo useful:

You can enter token patterns on the left, and your text in the box on the right, and it will show you whether the patterns match and if so, how. You can also toggle "Show tokens" to see the model's tokenization.

Thank you very much. You were correct. coca-cola will not match as the hyphen is tokenized.

1 Like

Hi Ines,
For my case I need to preserve the hypen instead of splitting things there.
If we initialize Tokenizer with non default infix, prefix , suffix etc then it will not split at “-” and some other symbols where Spacy ideally split things - correct?

I am able to use this in the recent past

t= Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                               suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                )

with some punctuation symbols taken out from the regex (prefix_re, infix_re, suffix_re). That way I was able to keep post-secondary, co-op etc. as is/together.

I just purchased Prodigy so have not tried with the above. But the above should work - right if I can place the above in processing pipeline properly?
Thanks in advance.

Yes, if you save out a model with your own custom tokenizer, you can use that in Prodigy and the text will be split using your custom rules. If you've added a custom tokenizer to your nlp object and call nlp.to_disk, you'll be able to save the model including the tokenization rules. You can then use the model directory as the input model in Prodigy.

You'll then also be able to write patterns that match your custom tokenization :slightly_smiling_face:

1 Like