Help with messy data

oneextrafact · January 15, 2019, 7:40pm

Hi - I’m working with a set of JIRA tickets and trying to do NER against them. Specifically, I would like to ‘upgrade’ the PERSON, PRODUCT, and ORG tags for our data. This seems to be Really Hard because we have a BadHabit of Capitalizing Stuff, and also a lot of VariableNames / thread handles / other ghastly stuff in there. Also a lot of random whitespace and line returns. I have prodigy and have been using the ner.teach recipe with the data, but I’m not really able to get much better than 50% correct. Is there any advice that people can offer?

honnibal · January 17, 2019, 12:13pm

It sounds like you might be better of trying from scratch, instead of from the pre-trained model. To do this, you’d run the ner.manual recipe, and just click and drag. I would probably do one label at a time, as it stops you from having to select the labels manually, and it’s much faster (and more accurate) to hunt for only one type of entity at a time.

This lets you build gold-standard data, which is good for evaluation and also lets you train witht he --no-missing flag. If you use the ner.teach recipe, you produce data that doesn’t have complete annotations — the model still has to guess the correct analysis, based on the hints you’re giving it. If the model’s initial accuracy is too low, this doesn’t work properly, so it ends up better to train from scratch.

oneextrafact · January 18, 2019, 7:51pm

Thanks for your response! I tried training one label at a time and got 80% accuracy in training for each. I do have a further question though - when I tried to batch train on both labels at once accuracy went right back to about 50. m I doing it wrong? Should I be training two separate models?

Thanks!

honnibal · January 19, 2019, 3:25pm

Did you annotate the same texts? If you have only one label annotated per text, then you won’t be able to use the --no-missing label, as the model won’t be able to assume that the absence of an annotation means the absence of an entity.

oneextrafact · January 19, 2019, 8:35pm

Yes, I’m moving through the same text file for both labels. The ORG label is much less common though, so I had to bootstrap using a patterns file, so they’re not necessarily looking at the same examples. That said, a large number of the sentences I’m annotating (and our actual data) contain both labels (e.g. look at something in PRODUCT for this ORG). If I’m training a single label, would it be better to skip these sentences to make sure that I could train with the no-missing flag? Maybe then I would take care of the multiple annotations in a make-gold recipe against the same texts?

Also, is there somewhere out there I can find a good sense of the workflow for this kind of project? I’m eagerly awaiting your book, but I know it’s not out yet!

ines · January 19, 2019, 8:47pm

Maybe something like this "silver to gold" workflow could be useful for your situation? See here:

github.com

explosion/prodigy-recipes/blob/master/ner/ner_silver_to_gold.py

import prodigy
from prodigy.models.ner import EntityRecognizer
from prodigy.components.preprocess import add_tokens
from prodigy.components.db import connect
from prodigy.util import split_string
import spacy
from typing import List, Optional


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "ner.silver-to-gold",
    silver_dataset=("Dataset with binary annotations", "positional", None, str),
    gold_dataset=("Name of dataset to save new annotations", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
)
def ner_silver_to_gold(

This file has been truncated. show original

The idea here is to stream in annotations from an existing dataset (created by acceting/rejecting) and merge them all to find the best possible analysis of the parse given the constraints defined in the existing annotations. You can then correct the combined annotations manually. Ideally, you'll only have to fill in some gaps here and there to turn the "silver standard" annotations into a combined gold standard set.

oneextrafact · January 20, 2019, 12:33am

Thanks again! This is awesome. I was able to get up to 71% accuracy this way. One final set of questions (I hope): when I’m training the gold annotations:

if I use an out of box model (e.g. en_core_web_sm), it recommends labels other than the ones I’m interested in (e.g. DATE). Should I remove these labels, or will leaving them in there cause no harm?
if I get a ‘bad’ / incorrect label for one of the entities I’m interested in, is it better to mark it as a failure, or correct and save as an ‘accept’?
if no entities are detected (because it was a ‘skip’ in the evaluation test), should I skip it here as well?

ines · January 20, 2019, 11:00am

That's nice to hear – definitely sounds promising

If you can, leaving them in there is definitely good. One thing to keep in mind about the pre-trained models is that their weights are based on the presence of all labels in the original training data. So if you're trying to add new labels that conflict with the existing ones (e.g. TIME_PERIOD vs. DATE), or trying to teach the model a completely different analysis all of a sudden, this can potentially cause problems and will require a lot more training data.

If you find that you actually only really care about one or two entity types in the original model, it might make more sense to start from scratch, instead of "fighting" the existing predictions. You can still take advantage of the pre-trained model to bootstrap your new annotations – for example, using a workflow similar to the ner.make-gold recipe that pre-labels your examples using the model's predictions. (Even if your model only gets like 60% correct, that still means you only have to put in 40% of the work )

If you're creating gold-standard data manually, you probably want to correct it so it'll be included as a correct training example in your data. In this scenario, examples you reject would be examples that are not easily fixable – for example, if the tokenization is bad,

If you're annotating with binary feedback, then yes, an incorrect label should always be rejected. The same goes for "almost correct" suggestions.

Examples of texts with no entities are also super valuable training data. Your model will likely perform much better if it gets to see examples of what an entity looks like, as well as examples of what's not an entitiy. So if you come across a text without entities, you should always mark it as "accept".

(It can still make sense to skip examples if they're not representative of your data at all or otherwise unsuitable – for example, broken markup or other preprocessing artifacts. If something is marked as "answer": "ignore", it will always be excluded from training by default.)

oneextrafact · January 20, 2019, 8:41pm

@ines, thanks very much for your help - this has certainly been educational! It seems like I have a lot more to learn, but this is a great start.

Topic		Replies	Views
ner.train number of examples usage , ner	8	1948	August 3, 2018
Prodigy 1.90 train recipe --ner-missing argument usage , ner , solved	7	4587	March 14, 2020
Understanding ner.batch-train stats usage , ner , solved , best-practices	7	2707	October 26, 2018
Trying to train module with Ner.manual after ner.batch-train results are not perfect usage , ner	5	2901	February 3, 2018
Best strategy for training an NER engine usage , ner	8	2177	December 27, 2017

Help with messy data

Related topics