I bought a license with the hope of cleaning up some historic deed data (early documents typwritten on vellum, then microfilmed, then jpg-ed.
I ran the whole thing through tesseract and PICCL, but I haven't been able to get very clean text yet. Here's an example of what I have to work with:
No. 4 Allof Lot “Numbered one (1) md the Fact ote-half of Let numbered
‘Pwo (Be) in Blook twnbered Five (5) in <+- P.T. Smith's Addition to the Tom of St Johne
jand also the West One-half of Lot Numbered Two (Wy 2) in Blook Vumberee Six (6) in P.T.
Smith's Addi tion to the Town of St Jehns as s'iown and designated on the culy recorded plat
: f£ said P.T.Gmith’s Addition to the Town of St Johns, now of record with the clerk of
Henze osiad Colmty O2egon
I want to find all of the entities like:
West One-half of Lot Numbered Two (Wy 2) in Blook Vumberee Six (6) in P.T. Smith's Addi tion to the Town of St Jehns
I have a list of all of the addition names to use as an ontology (in this case, "P.T. Smith's Addition to The Town of St. John's" At this point, I assume I need to do some manual tagging and annotation.
I am fine with not correcting the text as long as I can identify the parcels.
All ideas appreciated!
Hi @alankessler, welcome!
Your project sounds like an interesting one. I originally bought a license to try detecting harassment in tweets. Prodigy can be used for so many cool things!
I've never done anything like what you present, but I have a few thoughts based on your description and example text:
- your example text is long'ish, try setting the "segment sentences" option so that the examples you annotate are only roughly a sentence in length. This makes the annotation process go by measurably quicker (it's easier/quicker to annotate a sentence than a paragraph)
- check out the video series on training an NLP model from scratch that is currently being released. It hasn't gotten to the Prodigy specific bits yet, but because Prodigy uses spaCy the advice he offers is great and seems relevant to what you're doing.
- use the Matcher API to build up rules that identify "some" of your desired target (i.e. "West One-half of Lot Numbered Two (Wy 2) in Blook Vumberee Six (6) in P.T. Smith's Addi tion to the Town of St Jehns") once you have patterns use them with active-learning to build a NER model.
Hope it helps!
Also, to add to Justin's suggestions: You might also want to look into preprocessing to clean up some of the artifacts of the OCR that you can detect programmatically – for instance, sequences of punctuation that shouldn't be there, or maybe missing whitespace. Those are all things that'll likely make it much harder to tokenize the text and train a model later on. If you haven't seen it already, you might want to check out
It might also make sense to tweak some of the existing tokenization rules to better match your data.
If you have enough raw text, you could train some word vectors using Gensim and initialize the model with that (see the
init-model command). The vectors will be used as features in the model and can help encode at least some knowledge about the very custom terminology that you're dealing with here.
Finally, I agree that for this use case, you probably want to start labelling by hand (at least initially) and train a new model from scratch. The phrases you're looking for are quite long and seem to include things like prepositions (which is a bit different from what's typically considered a "named entity"). So it might make more sense to create a label scheme for the individual components of your "entities" – even if the model misses one of them, you'd still be able to resolve it to the full span and extract the information you need.