I am fairly new to NLP and have found spaCy really helpful.
So, I have this project where I want to train a spaCy blank model to identify custom entities in a string.
The text does not consist of ‘meaningful’ sentences, it’s like, some labels which are followed by either numbers
or alphanumeric characters even special characters like hyphens or slashes.
For this, I want to annotate the data with my custom entities using prodigy and export the annotated data to spacy in json format. so, how do i achieve this?
Thanks in advance.
Hi! In general, spaCy is optimised around “real” text – e.g. sentences, paragraphs, real words. So you might find that you need to customise some of the tokenization rules to make sure your texts will actually be split into meaningful tokens. If you haven’t done this yet, I’d recommend running spaCy over some of your texts and checking whether the tokens match up with what you’re trying to label. For example, if the tokenizer produces
"ABC-123" as one token, but your entity is
"123", you won’t be able to train this effectively.
ner.manual recipe streams in your text and lets you label the entity tokens by hand. That’s often the safest way to go about annotating new entity types from scratch. But it’s not always the most efficient – so if you’re able to express examples of the entities with abstract token patterns (e.g. the token shape or whether it’s a number), you could also experiment with
ner.teach with patterns. This will pre-label examples that you can then accept or reject.
I actually just recorded a video the other day that discusses some of the trade-offs and how to decide which annotation mode to use:
You might also find @honnibal’s video on training a new entity type useful:
Finally, if the entities you’re trying to recognise are mostly combinations of letters/numbers etc., it might turn out that a rule-based approach with regular expressions or token patterns will always beat your statistical model in accuracy. So don’t be too disappointed if things don’t work out. But I hope Prodigy can make it easier to experiment with different approaches
Thanks a lot for a quick reply!
actually, I don’t want to go for pattern matching.
the evaluation for the model with just 450 data points and 50 iterations is as follows-
p_score=74.8, r_score=71, f_score=72.8 which i consider okay-ish. I hope I could improve the scores
with more data and I want to use prodigy for annotation if it can give output in a readily usable spacy-json format.
thanks again for the help.