More efficient preprocessing with similarity


We are using an entity annotated with prodigy in your similarity function to suggest to a user the most appropriate product code from 4.5k codes where the code description fits the entity

Before applying the similarity function we manipulate the word frequency of the 4.5k code descriptions to surface the most important words. This gives us a huge column where each rows has to be nlp'ed - this take ca 14mins.

Can you recommend a more efficient way of doing this?

How are you currently doing it? Are you using nlp.pipe and disabling the components you don't need? See here for details on efficient processing:

1 Like

Thanks Ines. We do need all the pipes but have been able to recode so that for the largest task they get disabled and restored later. Thank you

1 Like