NER training best practices

Hi! Thank you for making such a great and useful product.

I’m training my first NER and have many questions about what the best practices are.

  1. What response should I give prodigy if it marks only a part of an entity correctly? For example, if the text is “I live in 123 High St., Harrisburg, PA” and it only marks High St., Harrisburg, PA. (I’d want it to either mark 123 Hight St., Harrisburg, PA entirely or at least 123 High St. and Harrisburg, PA separately but I wouldn’t wanna lose information)

  2. What’s the best practice to annotate entities that can be divided? Following the above example, I understand that I could either mark the whole thing (123 Hight St., Harrisburg, PA) as one entity or each entity separately (123 High St./Harrisburg/PA), but what would give me the most accurate results?

  3. How much would the NER improve if I used vectors trained on my own corpus? (my domain would be resumes in Spanish)

  4. I have more entities than the examples because of the domain I’m working on. Would I get better results if I train many NER, each for a subdomain of my problem, instead of all my entities with one classifier?

  5. (and last question, sorry for the long post!) What do you think is an appropriate number of training examples for the training set before I load it to prodigy?

Thank you and regards.

1 Like

Thanks a lot :blush: Answers below!

Incorrect boundaries are wrong and should be rejected. The annotation decisions you make will be used as the constraints for the model, so if you want to tell Prodigy to “try again” with different boundaries, it’s important that you reject the incorrect ones. Prodigy will often suggest several different analyses of the same example as separate tasks, so it’s easy to pick out the correct or best one.

After you’re done annotating, you can always go back, extract the examples you’ve rejected from your dataset and reannotate them. Just run db-out with --answer reject and you’ll get a JSONL file with all rejected examples. You can then load the file with the ner.mark recipe (see here for details) and highlight the exact entity spans. You could also look into the ner.make-gold recipe (see here), which allows you to make several passes over the data and use the constraints defined in the annotations to narrow in on the gold-standard parse.

This depends on what you’re trying to achieve, what you’re planning to use the model for later on and what the runtime input will look like. In general, it’ll probably be faster and more efficient to teach your model shorter entities with Prodigy, since this is closer to what the base model will originally suggest. So if your main task is to determine whether a document contains an address, working with shorter entities should be fine. It’ll also give your model more chances to get it right and at least recognise one part of the address.

However, if your application needs to extract full addresses and look them up in a database (or something like that), you’ll might want the model to be able to label the whole thing correctly. This is also possible, but might need a little more annotation work.

To make sure you start off with enough examples of different addresses, you could create a a match patterns file to help pre-select them. Assuming your addresses always follow consistent schemes, you could create a bunch of different variations of patterns like this, to capture the most common occurrences:

    "label": "ADDRESS", 
    "pattern": [
        {"is_digit": true},  // 123
        {"is_alpha": true},  // High
        {"lower": "st."},    // St. (a tokenizer exception in spaCy)
        // the part below is only relevant for full addresses
        {"orth": ","},       // ,
        {"is_alpha": true},  // Harrisburg
        {"shape": "XX"}      // PA

I’ve also written some more about best practices and dealing with ambiguous decisions in this thread. When it comes to best practices, I’ve found that a good question to ask yourself is “If my model produced this result, would I be happy about it and would it benefit the rest of my application?” In the end, this is what really matters.

This is hard to guess – depending on your data, you might see a 10-30% error reduction. So if it’s not too much of a hassle to train vectors, definitely try it out and see how you go. The vectors will also be very useful if you want to bootstrap domain-specific terminology lists with terms.teach, which you can then convert to NER match patterns using

If your entity types don’t overlap, you can train each type as a separate label as part of the same model. This could also give you a nice boost in accuracy, because the entity recognizer can take advantage of the shared representations and the mutually exclusive categories. For example, if your entity types are ADDRESS and PHONE_NUMBER, and “12345 High St.” is recognised as an ADDRESS, the model also knows that “12345” can’t be a PHONE_NUMBER.

You could also try making your entity types more specific – for example, ADDRESS, PHONE_NUMBER and EMAIL instead of just CONTACT_DETAILS. Depending on your data, this could help the model learn, as it can draw more conclusion from the surrounding context of the individual types. But this all comes down to experimenting, which is something that Prodigy should hopefully make much easier :blush:

To make the most of the active learning workflow, you should probably have at least a few thousand sentences available – ideally more. Keep in mind that when using ner.teach, Prodigy will only ask you about the examples the model is most uncertain about. This means it’ll skip examples with very confident predictions so you can focus on the examples that will produce a more relevant gradient for training. As the model is updated, the example selection will adjust as well. So having more data is always good.

Data streams are implemented as generators, so if you load in files that can be read in line-by-line (e.g. jsonl or txt), you should be able to load in pretty large corpora. Prodigy will simply batch them up and forward them to the app as they come in, and you won’t have to wait for the entire file to process.

1 Like

Thank you so much, your answers were very detailed and clear, they’ve helped me out a lot and were exactly what I was looking for.

1 Like