spaCy, prodigy, annotation

Sha_M · February 7, 2019, 8:06am

Hello there!
I am fairly new to NLP and have found spaCy really helpful.

So, I have this project where I want to train a spaCy blank model to identify custom entities in a string.
The text does not consist of ‘meaningful’ sentences, it’s like, some labels which are followed by either numbers
or alphanumeric characters even special characters like hyphens or slashes.

For this, I want to annotate the data with my custom entities using prodigy and export the annotated data to spacy in json format. so, how do i achieve this?

Thanks in advance.

ines · February 8, 2019, 1:34am

Hi! In general, spaCy is optimised around “real” text – e.g. sentences, paragraphs, real words. So you might find that you need to customise some of the tokenization rules to make sure your texts will actually be split into meaningful tokens. If you haven’t done this yet, I’d recommend running spaCy over some of your texts and checking whether the tokens match up with what you’re trying to label. For example, if the tokenizer produces "ABC-123" as one token, but your entity is "123", you won’t be able to train this effectively.

The ner.manual recipe streams in your text and lets you label the entity tokens by hand. That’s often the safest way to go about annotating new entity types from scratch. But it’s not always the most efficient – so if you’re able to express examples of the entities with abstract token patterns (e.g. the token shape or whether it’s a number), you could also experiment with ner.match or ner.teach with patterns. This will pre-label examples that you can then accept or reject.

I actually just recorded a video the other day that discusses some of the trade-offs and how to decide which annotation mode to use:

You might also find @honnibal’s video on training a new entity type useful:

Finally, if the entities you’re trying to recognise are mostly combinations of letters/numbers etc., it might turn out that a rule-based approach with regular expressions or token patterns will always beat your statistical model in accuracy. So don’t be too disappointed if things don’t work out. But I hope Prodigy can make it easier to experiment with different approaches

Sha_M · February 8, 2019, 4:49am

@ines

Thanks a lot for a quick reply!
actually, I don’t want to go for pattern matching.
the evaluation for the model with just 450 data points and 50 iterations is as follows-
p_score=74.8, r_score=71, f_score=72.8 which i consider okay-ish. I hope I could improve the scores
with more data and I want to use prodigy for annotation if it can give output in a readily usable spacy-json format.

thanks again for the help.

Topic		Replies	Views
How do I use prodigy as a purely annotation tool with no underlying SpaCy model? usage	1	1591	April 27, 2018
Formatting Prodigy annotations for evaluation of external NER models using spaCy usage , ner , spacy	4	596	April 13, 2022
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Prodigy annotations to SpaCy train spacy	13	5617	January 31, 2018
How to use the spacy data to prodigy ner.manual and continue the annotation? usage , ner , spacy , custom	1	567	June 14, 2021

spaCy, prodigy, annotation

Related topics