Hi, I'm new to Prodigy and Spacy but I'm a fast lerner
I need to train a model to recognise entitiy, consisting of clusters of abbreviations (with spaces between them). It's not in English, so I trained a basic parser/tagger model from the Universal Dependancy Treebank for that language. The model has the NER pipeline but it's empty.
So I started with the manual tagging and had some 1000 paragraphs annotated. I trained the first temp model and did a second manual tagging based on the temp model (another 500 paragraphs).
Then I tried the teach recipe but the second temp model only catches the first abbreviation of the entity cluster (usually there are 2-3 abbreviations within a tagged entity).
In the process of manual tagging the full entity (cluster of abbreviations) are coloured, so I'm sure I tag what I want, however the model only recognises (and auto tags) only until the first space within the entity cluster. Is there something specific I need to do so that the model recognises the full phrase (tagged cluster of abbreviations)?
The entity pattern is roughly like this: "aa. NN, aa. NN Aaaaaa", where "a" is letter and N is a number. I'm reading the data from a txt file where I've split the sentences on newline each. Is prodigy trying to split them again, so the dot is considered end of sentence and this is where it's going wrong?
Also, I haven't used any word2vec model as a base, if I do, would that change what I experience above?
Hi! Looking at the sentence boundaries is definitely a good idea and could give you clues as to what might be happening. Maybe just process some of your examples with your model and check the doc.ents?
Even if Prodigy isn't separating the document into sentences, the underlying Doc objects you're predicting over may still have sentence boundaries set, and those can affect the NER results. The entity recognizer considers spans across sentence boundaries invalid, so it won't predict them โ this is typically very helpful and can improve your results, but of course it's problematic if the sentences split actual entities.
Thanks, Ines! In that case, if I remove punctuation from the abbreviations (from the source file), would that solve my problem and improve recognition of those abbreviations? I'm new to NLP so I'm still learning the concepts.
In theory, maybe โ but it would mean that you'd have to perform the same preprocessing at runtime because otherwise, your model trained on entities without punctuation may produce bad results if it does come across punctuation at runtime. So that's typically not what you want.
If it turns out that the default sentence segmentation (performed by the parser) splits your entities or produces sentences that are unideal, one solution could be to add a simpler rule-based segmentation strategy first in the pipeline: Linguistic Features ยท spaCy Usage Documentation
That's interesting. Definitely the default sentence segmentation is not what I want, since it was not trained for my language (to ignore abbreviations). Could I use regex in this custom segmentation, as my entities follow some patterns?
EDIT:
Looking at the example, may be something like this (will fine tune the regex expression)?
def set_custom_boundaries(doc):
import re
for token in doc[:-1]:
if re.match(r'[a-zA-Z]\.',token.text,re.U):
doc[token.i+1].is_sent_start = False
return doc
EDIT 2:
Ok, I've done my custom segmentation function, now I just have to add it to model's pipeline. Let's see if it works with the model Thanks for the clues!
Hi @ines, so the earning curve is quite steep here, and I'll appreciate some help. I've got the custom function (the component) working as I want outside prodigy, but I'm having trouble understanding how to implement the custom component so it's called within Prodigy. As per your explanation here, I should create a model package, but I'm not sure of the following:
Once I create the component .py file in the model's package root dir, do I put my custom function logic under __call__ method?
Again , in your example in the link above, there are __init__, __call__ and from_disk methods. Is from_disk method mandatory for the component to work?
Ok, I think I have everything setup, but how do I find out if my custom component is actually running when using ner.manual recipe for example? If I set log level to "verbose" it doesn't show anything about pipelines.
Also, how do you setup the order of pipeline execution when using Prodigy? I've found info to set the order in regular Spacy script, but nothing when running Prodigy recipes. As you mentioned above, I have to execute my component before the tagger/parser, so that my rules affect sentence boundaries segmentation in the Prodigy UI.
That said, there are two different workflows here: annotation and training. During annotation with ner.manual, the sentence boundaries matter less because you're not actually predicting anything. So whether or not your component runs won't really make a difference. Where it does matter is in recipes that predict named entities (like ner.teach or ner.correct), or during training.
No, to_disk can be used to implement custom serialization (like, if your component needs to savo out data). But it's not required.
The order of pipeline components is defined in the model's meta.json, which gets saved automatically based on nlp.pipeline / nlp.pipe_names when you save out the model. Prodigy will load your model package by calling spacy.load, so it will use whatever is available in the model's pipeline.