Identify Email Signatures

I am interested in using Prodigy to tag and then automatically identify signature blocks in email messages.

This thread indicates that Prodigy and Spacey might not be a great fit for this use case:

Is that correct? Any recommendations on good paths forward for this problem?

It’s quite possible that spaCy and Prodigy won’t provide very good out-of-the-box solutions. It’s also really hard to give any sort of recommendation.

What makes NLP “work” as a field is the fact that natural language basically has a lot of common structure to it. We’re all running some sort of software in our heads that lets me generate messages that you’re able to interpret, and vice versa. Unfortunately we don’t know very much about how those programs work, but we can still try to write programs that have some likelihood of doing useful things on messages generated by them.

When you’re dealing with something like email signatures, you’re dealing with messages that are probably generated by a small number of very specific programs — like, people probably are using the same email clients, maybe the same settings for how their signature blocks work, etc. Depending on the region you’re working with, maybe these programs are configured differently, maybe different programs are common in your dataset, etc. The more you can exploit the regularities in your data, the more accurately and easily you’ll be able to extract information. I can’t really share any useful insights about this, except to say that the programs generating your email signatures are vastly less complicated than the programs we’re using to generate and interpret normal English sentences. Given this difference in complexity, the statistical machinery built into a library like spaCy is probably an unideal fit for what you’re doing.

You might find the data structures in spaCy useful for tokenizing the text, or manipulating it in various ways. On the other hand, you might find it easier to work with the raw strings, and just use regular expressions. It depends on the data you’re working with, and the specific goals of your processing.