Does anyone have suggestions for using a pattern file to assist training things do not have an entity label? I’m training an NER task and find I’m often prompted to classify things like spaces, numbers or punctuation that are never entities.
Hi! The pattern files are currently only intended to help with generating positive candiates – but you could always add a filter function that only yields out examples with spans that do not match your exclusion pattern. Here’s a simple example that shows the idea (but of course, you can use more sophisticated logic here):
def filter_tasks(stream):
for eg in task:
# Check the highlighted span in the example and only
# send it out if it doesn't match your exclusion list
span = eg["spans"][0]
if span["text"] not in ("\n", ".", ","): # etc.
yield eg
Btw, you might not always want to use a filter like this, especially not during development. If you’re using a recipe like ner.teach
, Prodigy will stream in suggestions from the model with their scores assigned (see the meta section in the bottom right corner). So it can sometimes be very interesting to see what the model is predicting and what scores it’s assigning. For example, you might see that the model is very uncertain about some type of non-entity spans – this could indicate a problem with the data or pre-trained weights. If you filtered those examples out, you wouldn’t ever get to see those suggestions and scores, and you’d never be able to give the model feedback on those.