Thank you for the awesome free SpaCy library!
We are looking for an annotation tool that allows us to train our system to extract information in listings of real estate in Singapore.
Apart from using Prodigy to manually tag the entities in the listings, can we import lists of entities, such as lists of real estate developers in Singapore/main roads in Singapore/real estate projects in Singapore, so our system can recognise the entities without being trained using labelled examples?
Unfortunately our dataset isn't large enough for the system to learn to recognise and extract ALL those entities.
Hi! The main goal of training a model from labelled examples is to allow your system to generalise and extract other similar entities in similar contexts, even if it hasn't seen those during training. For example, it could recognise a road name that's not in your list because it's mentioned in a similar context as other road names in the data. If that's your goal, you should want to train a model on examples.
If you have existing lists, you can use them in Prodigy to pre-label examples for you, so you only need to correct the suggestions and fill in the blanks. That's much faster than doing everything by hand. Check out the examples of annotating named entities with patterns: https://prodi.gy/docs/named-entity-recognition#manual-patterns
If you don't want to train a model and just recognise whatever is in your lists, you can just use spaCy's
EntityRuler and load in your lists. See here:
You can also combine this approach with a model later on, so you can have a system that generalises, but also reliably tags whatever is in your list. Definitely run the rule-based approach first and evaluate it because that gives you a baseline accuracy. (For instance, if you get to 95% using only your lists, it's very likely that a model won't be able to beat that )
Thanks for your kind answer and advice!