NER model to extract addresses from text

Hi folks !

Need to train a custom NER model to extract addresses from texts, but after many google searches can't find a convenient dataset (text containing addresses),
Anyone can help with this : Where I could find such dataset ? Or if u have any other suggestions to build the model ?

Hey Zakaria,

I've been trying to build something similar on my end. I have a large corpus of legal documents that have been OCR'd (with varying scan qualities) via Tesseract and I'm looking to extract names and addresses.

I'm currently going trying regular expressions to match for street addresses, create a JSONL file and cycle through the following recipes until I get the results I'm happy with:

  • ner.teach
  • ner.match
  • ner.batch-train
  • ner.train-curve

Check out the flowchart for more details.

1 Like

Hi Jemmy thanks for ur reply,
My problem is that I didn't find any dataset of texts containing addresses, for annotation I will do it manually I have no issues with that,

What texts do you want to process with the model later on? Can't you just use those texts? You typically want to be training your model on data that's similar to what you'll be analyzing at runtime. Publicly available datasets can be useful sometimes if they're similar enough, but it's better to use your own data.