Help with messy data

Hi - I’m working with a set of JIRA tickets and trying to do NER against them. Specifically, I would like to ‘upgrade’ the PERSON, PRODUCT, and ORG tags for our data. This seems to be Really Hard because we have a BadHabit of Capitalizing Stuff, and also a lot of VariableNames / thread handles / other ghastly stuff in there. Also a lot of random whitespace and line returns. I have prodigy and have been using the ner.teach recipe with the data, but I’m not really able to get much better than 50% correct. Is there any advice that people can offer?

It sounds like you might be better of trying from scratch, instead of from the pre-trained model. To do this, you’d run the ner.manual recipe, and just click and drag. I would probably do one label at a time, as it stops you from having to select the labels manually, and it’s much faster (and more accurate) to hunt for only one type of entity at a time.

This lets you build gold-standard data, which is good for evaluation and also lets you train witht he --no-missing flag. If you use the ner.teach recipe, you produce data that doesn’t have complete annotations — the model still has to guess the correct analysis, based on the hints you’re giving it. If the model’s initial accuracy is too low, this doesn’t work properly, so it ends up better to train from scratch.

Thanks for your response! I tried training one label at a time and got 80% accuracy in training for each. I do have a further question though - when I tried to batch train on both labels at once accuracy went right back to about 50. m I doing it wrong? Should I be training two separate models?

Thanks!

Did you annotate the same texts? If you have only one label annotated per text, then you won’t be able to use the --no-missing label, as the model won’t be able to assume that the absence of an annotation means the absence of an entity.

Yes, I’m moving through the same text file for both labels. The ORG label is much less common though, so I had to bootstrap using a patterns file, so they’re not necessarily looking at the same examples. That said, a large number of the sentences I’m annotating (and our actual data) contain both labels (e.g. look at something in PRODUCT for this ORG). If I’m training a single label, would it be better to skip these sentences to make sure that I could train with the no-missing flag? Maybe then I would take care of the multiple annotations in a make-gold recipe against the same texts?

Also, is there somewhere out there I can find a good sense of the workflow for this kind of project? I’m eagerly awaiting your book, but I know it’s not out yet!

Maybe something like this "silver to gold" workflow could be useful for your situation? See here:

The idea here is to stream in annotations from an existing dataset (created by acceting/rejecting) and merge them all to find the best possible analysis of the parse given the constraints defined in the existing annotations. You can then correct the combined annotations manually. Ideally, you'll only have to fill in some gaps here and there to turn the "silver standard" annotations into a combined gold standard set.

Thanks again! This is awesome. I was able to get up to 71% accuracy this way. One final set of questions (I hope): when I’m training the gold annotations:

  • if I use an out of box model (e.g. en_core_web_sm), it recommends labels other than the ones I’m interested in (e.g. DATE). Should I remove these labels, or will leaving them in there cause no harm?
  • if I get a ‘bad’ / incorrect label for one of the entities I’m interested in, is it better to mark it as a failure, or correct and save as an ‘accept’?
  • if no entities are detected (because it was a ‘skip’ in the evaluation test), should I skip it here as well?

That's nice to hear – definitely sounds promising :+1:

If you can, leaving them in there is definitely good. One thing to keep in mind about the pre-trained models is that their weights are based on the presence of all labels in the original training data. So if you're trying to add new labels that conflict with the existing ones (e.g. TIME_PERIOD vs. DATE), or trying to teach the model a completely different analysis all of a sudden, this can potentially cause problems and will require a lot more training data.

If you find that you actually only really care about one or two entity types in the original model, it might make more sense to start from scratch, instead of "fighting" the existing predictions. You can still take advantage of the pre-trained model to bootstrap your new annotations – for example, using a workflow similar to the ner.make-gold recipe that pre-labels your examples using the model's predictions. (Even if your model only gets like 60% correct, that still means you only have to put in 40% of the work :wink:)

If you're creating gold-standard data manually, you probably want to correct it so it'll be included as a correct training example in your data. In this scenario, examples you reject would be examples that are not easily fixable – for example, if the tokenization is bad,

If you're annotating with binary feedback, then yes, an incorrect label should always be rejected. The same goes for "almost correct" suggestions.

Examples of texts with no entities are also super valuable training data. Your model will likely perform much better if it gets to see examples of what an entity looks like, as well as examples of what's not an entitiy. So if you come across a text without entities, you should always mark it as "accept".

(It can still make sense to skip examples if they're not representative of your data at all or otherwise unsuitable – for example, broken markup or other preprocessing artifacts. If something is marked as "answer": "ignore", it will always be excluded from training by default.)

@ines, thanks very much for your help - this has certainly been educational! It seems like I have a lot more to learn, but this is a great start.

1 Like