Spancat or NER for email signatures

I'm new to Spacy and Prodigy and have a dataset of email correspondence. In order to get the latest text body from a chain of correspondence, I want to train a model to recognize the end of a message 'EOM'. This could be simple phrases like "Thanks," or "Best Regards,", a person's name, or various forms of an email signature "Company Name", "Company Name + Tel:" etc.

My assumption is that I should break these various EOM's into their respective types. For email signatures, I should train a Span categorizer and then combine that with a phrase matcher for the simpler "Thanks" "Best, John" EOMs.

Does this sound like the best method for my objective or am I making a rookie mistake?

Hi @jordandavis , welcome to Prodigy!

You have two options here:

  1. Usually it is good to check how well a naive, non-ML solution works. Try implementing a simple rules-based function to parse email-address and check your accuracy. You can use things like regex, checking the tokens in a Doc, hand-built business rules, or just using the PhraseMatcher.
  2. For a more machine learning approach, I'd recommend trying out the SpanCategorizer. When doing this, always ensure that your annotation scheme is consistent.