Ok hmmm, maybe i don’t really understand how it works.
i can see i don’t load my text file containing my materials in terms.to-patterns command.
can it be the problem ?
I think i dont understand how prodigy works.
Don’t worry – Prodigy introduces a lot of new concepts, so it’s totally fine if things are a little confusing at first. We hope we can help you get started, and we’re always working on improving the documentation!
terms.to-patterns is a recipe for converting a Prodigy dataset of texts from previous annotations to a patterns JSONL file. For example, you can use terms.teach to create a dataset of similar words from word vectors, and then convert that dataset to a patterns file.
If you already have the terms – for example, your materials – you can skip this step. You can also just create a patterns file yourself. How does you materials.jsonl look? Here’s an example of a patterns.jsonl file – maybe you can just convert your materials to this format?
Yes, the terms.to-patterns is only for a Prodigy dataset. If you already have the terms in a file, you can just write a simple script yourself that converts them to a JSONL file like this:
In theory, yes – but keep in mind that the patterns always need to reflect spaCy's tokenization. If the tokenizer splits a hyphenated string, the pattern will never match.
If you want to test your patterns, you might find our Rule-based Matcher Explorer demo useful:
You can enter token patterns on the left, and your text in the box on the right, and it will show you whether the patterns match and if so, how. You can also toggle "Show tokens" to see the model's tokenization.
Hi Ines,
For my case I need to preserve the hypen instead of splitting things there.
If we initialize Tokenizer with non default infix, prefix , suffix etc then it will not split at “-” and some other symbols where Spacy ideally split things - correct?
with some punctuation symbols taken out from the regex (prefix_re, infix_re, suffix_re). That way I was able to keep post-secondary, co-op etc. as is/together.
I just purchased Prodigy so have not tried with the above. But the above should work - right if I can place the above in processing pipeline properly?
Thanks in advance.
Yes, if you save out a model with your own custom tokenizer, you can use that in Prodigy and the text will be split using your custom rules. If you've added a custom tokenizer to your nlp object and call nlp.to_disk, you'll be able to save the model including the tokenization rules. You can then use the model directory as the input model in Prodigy.
You'll then also be able to write patterns that match your custom tokenization