Hi,
When I showed our CIO the capabilities of spaCy and Prodigy he was eager to see us move toward production. One of the questions he asked was about maintainability of the rule sets by end users. He asked how hard it would be to provide a UI for rules maintenance.
My answer was:
easy for keyword lists and phrases
hard for token based rules as there are so many permutations
Have folks talked about or demo'd a UI for end user maintenance of rules? Thoughts on the topic?
BTW, I'm aware of the lovely demo site: https://explosion.ai/demos/matcher which answers my question "have folks talked about or demo'd a UI". Thank you @ines for the demos you build - all very helpful as I learn how to leverage spaCy/Prodigy in our domain (mental health care) .
I'm interested in customizing/using/buying a utility much like the matcher demo but intended for our use case: end users maintaining their existing rulesets for our custom entities/models.
I don't see the code for the demo available on github, I looked as it would make a good start to filling our needs. At the very least, with the demo we have a UI pattern to follow...
@ines, if you are willing/able to share you matcher demo code, I will happily submit pull requests or create a separate version that supports our use case. Even if starting from your demo code doesn't make sense, I'll plan on submitting what we build to spaCy Universe as I imagine ours isn't the only firm that wants to enable users to manage their rules.
Sorry, I didn't get around to this over the weekend! Glad you like the matcher demo
I think a big advantage of the token-based patterns over just regular expressions is that they're pretty readable and also pretty easy to visualize and process programmatically – after all, they're just lists of dicts.
A lot of it also depends on guidelines and best practices you define early on. For instance, you should probably keep your patterns concise and try not to overcomplicate them by using lots of operators and match too many cases at once. Whether you have 50 or 500 patterns doesn't really make a difference in terms of performance – but 500 concise, easy-to-read patterns for specific cases could actually be much easier to maintain than 50 super complex patterns with conditional logic that cover multiple purposes at once.
I haven't open-sourced that code because it's somewhat entangled with our website – but maybe I could make a small standalone example based on it. The idea is pretty straightforward, actually: each token block becomes an object and each selected attribute (dropdown of attribute names plus input field) becomes a {key: value} pair. At the end of it, you have a list of objects, which is the pattern.
Define a "complexity" threshold and show a warning if a user tries to create a pattern that's too complex – for instance, with 3+ token attributes for several tokens, or using more than one operator etc. This lets you enforce best practices early on and prevents the rules from getting out of control.
Have an endpoint that can take two rule sets (or one new rule and the existing rule set) and returns overlapping matches. This should be pretty easy to do on the back-end – you just need to compare the starts and ends of the matched spans. This way, you can check for conflicts or duplicate functionality as you're adding new rules or editing existing ones.
Have an extensive test suite and define rules and expected matches. You probably want to run this as often as possible, after each edit. Even if new rules perform as intended, they could, in theory, still cause regressions for other cases.
If your rules depend on linguistic attributes predicted by a model (e.g. part-of-speech tags), you also want to be running the tests for each new model you train. And you probably want to evaluate your rules by calculating an overall accuracy (instead of asserting that all rules need to match every time). A model is never going to be 100% accurate, so if you rely on a model's predictions, your rules aren't going to be 100% accurate either (which is fine – instead, they can be much more powerful).
Thanks @ines, I have so much to learn in this space and I appreciate the time you take to be thoughtful and thorough in your responses.
Apologies for tagging you in multiple posts over the weekend. I'm excited to have obtained approval to proceed on the project and my eagerness leaked out!