terms.to-patterns with existing data

a.dumas · March 21, 2018, 3:39pm

Hello,

Second problem of the day

when i do :
python3.6 -m prodigy dataset materials “Banque de materiaux”
=> Successfully added ‘materials’ to database SQLite.
python3.6 -m prodigy terms.to-patterns materials ./materials.jsonl --label MATERIAL
=> Can’t find dataset ‘materials’

Why ? Any idea ?

Thanks

a.dumas · March 21, 2018, 3:44pm

Ok hmmm, maybe i don’t really understand how it works.
i can see i don’t load my text file containing my materials in terms.to-patterns command.
can it be the problem ?
I think i dont understand how prodigy works.

ines · March 21, 2018, 3:53pm

Don’t worry – Prodigy introduces a lot of new concepts, so it’s totally fine if things are a little confusing at first. We hope we can help you get started, and we’re always working on improving the documentation!

terms.to-patterns is a recipe for converting a Prodigy dataset of texts from previous annotations to a patterns JSONL file. For example, you can use terms.teach to create a dataset of similar words from word vectors, and then convert that dataset to a patterns file.

If you already have the terms – for example, your materials – you can skip this step. You can also just create a patterns file yourself. How does you materials.jsonl look? Here’s an example of a patterns.jsonl file – maybe you can just convert your materials to this format?

{"label": "MATERIAL", "pattern": [{"lower": "steel"}]}
{"label": "MATERIAL", "pattern": [{"lower": "hard"}, {"lower": "wood"}]}

a.dumas · March 21, 2018, 4:12pm

Thanks for your answer.
At the moment, my file is like that :

bois
béton
ciment
fer
aluminium
…

That’s all.

I was thinking terms.to-patterns could turn from txt file to jsonl file but i probably make a mistake

ines · March 22, 2018, 7:59am

Yes, the terms.to-patterns is only for a Prodigy dataset. If you already have the terms in a file, you can just write a simple script yourself that converts them to a JSONL file like this:

{"label": "MATERIAL", "pattern": [{"lower": "bois"}]}
{"label": "MATERIAL", "pattern": [{"lower": "béton"}]}

You can then use your patterns.jsonl with the other recipes, like ner.teach

a.dumas · March 22, 2018, 8:16am

Hi,

Thanks a lot for your help.

Have a good day.

JayMan · May 3, 2018, 7:10pm

Hi,
Can we have terms with “-” e.g.
{“pattern”:[{“lower”:“coca-cola”}],“label”:“Product”}.

Obviously the above is different from
{“pattern”:[{“lower”:“coca”},{“x”:“cola”}],“label”:“Product”}

ines · May 3, 2018, 7:18pm

In theory, yes – but keep in mind that the patterns always need to reflect spaCy's tokenization. If the tokenizer splits a hyphenated string, the pattern will never match.

If you want to test your patterns, you might find our Rule-based Matcher Explorer demo useful:

You can enter token patterns on the left, and your text in the box on the right, and it will show you whether the patterns match and if so, how. You can also toggle "Show tokens" to see the model's tokenization.

JayMan · May 3, 2018, 7:35pm

Thank you very much. You were correct. coca-cola will not match as the hyphen is tokenized.

NovaVic · May 29, 2019, 8:20am

Hi Ines,
For my case I need to preserve the hypen instead of splitting things there.
If we initialize Tokenizer with non default infix, prefix , suffix etc then it will not split at “-” and some other symbols where Spacy ideally split things - correct?

I am able to use this in the recent past

t= Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                               suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                )

with some punctuation symbols taken out from the regex (prefix_re, infix_re, suffix_re). That way I was able to keep post-secondary, co-op etc. as is/together.

I just purchased Prodigy so have not tried with the above. But the above should work - right if I can place the above in processing pipeline properly?
Thanks in advance.

ines · May 29, 2019, 8:36am

Yes, if you save out a model with your own custom tokenizer, you can use that in Prodigy and the text will be split using your custom rules. If you've added a custom tokenizer to your nlp object and call nlp.to_disk, you'll be able to save the model including the tokenization rules. You can then use the model directory as the input model in Prodigy.

You'll then also be able to write patterns that match your custom tokenization

Topic		Replies	Views
terms.to-patterns adds data from a wrong dataset done , database , terms	1	780	March 21, 2018
Image Manual (How to use my .jsonl after I import them) usage , image , solved	6	421	April 4, 2020
Load dataset from recipe usage , database , solved	6	1640	October 15, 2018
Trailing data usage , solved	2	752	July 14, 2021
Issue with terms.teach recipe wrapper saving to SQLite (custom db setup) usage , database , custom , solved	3	468	September 5, 2019

terms.to-patterns with existing data

Related Topics