NER Complex Entity Web Interface Suggestions

Hi. Thanks for the great work, guys.

I am trying to train a NER model from scratch but I am stuck. I want to identify complex Entities. So I thought I could manual anottate for some time, and then use this manual anottantions as some kind of seed, similar to the “pattern” list.

How can I stop training manually, then using my initial training as seed for the web interface to show me suggestions and I accept it or not?

If it is doable, is there some kind of rule of thumb about the amount of manual training I should try before training using the model suggestions?

Thanks.

Best regards,
Fabio.

Thanks!

Just to confirm: You've annotated a bunch of entities in context and only want to extract the entities as patterns? That's an interesting approach I haven't thought of before, but definitely possible. I can also see how it makes sense in some cases – e.g. if you want to see the entities in different contexts etc.

The data produced by Prodigy follows a simple JSON format – so you can always use Python (or any other language, really), to convert it and extract whatever you need from it. That's pretty important to Prodigy's philosophy – we don't want to lock you in, and you should always have access to your collected data in a format that's easy to work with. To see the format of the annotations, you can use the db-out command:

prodigy db-out your_manual_ner_dataset | less

Each manually annotated entity is included as a "span" – so you can extract

patterns = []

for eg in examples:  # iterate over the examples
    if eg['answer'] == 'accept':  # you only want to use accepted answers
        spans = eg.get('spans', [])
        for span in spans:
            start = span['start']  # start offset in original text
            end = span['end']  # end offset in original text
            label = span['label']  # assigned label
            span_text = eg['text'][start:end]  # slice of text
            patterns.append({'label': label, 'pattern': span_text})

The above code will produce pattern entries that look like this:

{"label": "ANIMAL", "pattern": "tree kangaroo"}

If you want to produce token patterns (like [{"lower": "tree"}, {"lower": "kangaroo"}]), you probably want to tokenize the span_text with spaCy's tokenizer (the same model you used during manual annotation) and then create one token pattern for each tokenized string. If your entities are very simple and don't contain punctuation, you could also just split on whitespace.

You might also want to try pre-training your model on the already collected annotations, and then use this updated version as the base model for ner.teach – plus the patterns generated from your annotations. This means that the model will already start off with some knowledge of your entity type, and you'll have the terminology list to help you find mentions in different contexts.

This is difficult to answer, because it really depends on your data :wink: But as a rule of thumb, a few thousand annotations are usually a good start – sometimes more, though, depending on the complexity of the categories you're annotating. This is also the reason we've tried to offer different approaches and interfaces in Prodigy to help with this, which you can mix and match to see what works best. (For example, the patterns, terminology list from word vectors, fully manual annotation from scratch or ner.make-gold to correct the model's predictions etc.)

Thanks for your prompt answer!

I think this is the way to go. But I am still a bit confused on how to acomplish it. My task is to identify some kind of sentence groups that appears in my corpus that has some specific pattern or characteristic so later I can identify this pattern on a new paragraph . So I manual anotate every occorrunce of this complex pattern in every paragraph Prodigy shows me using the ner.manual.

I am a bit confused about the next step. Should I save this manual model and use it as the input model for ner.teach? Is that it? Did I miss something?

Oh, it is important to say tha every paragraph in my corpus contains the pattern in some part of it. My task is to identify where the pattern is inside the paragraph so I can extract and separate it from the rest of the paragraph. I could use a classification model for sentences. The problem is that my pattern is usually a group of a few sentences inside the whole paragraph.

Again, thank you very much.

Best regards,
Fabio.

The NER model isn’t always good at recognising long, complex patterns. Instead I would focus on building models that identify much smaller things. For instance, verbs of a particular type, or noun phrases of a particular type, etc. Once you have those recognisers, you can build a rule-based system to extract the larger spans of text you’re interested in.

If you start off with ner.manual, you can then train a model with ner.batch-train, pointing it at the dataset ID you created. If you use the -o flag, you’ll get a spaCy model as output. You can then use this in ner.teach to start getting suggestions for your classes.

Once you have your full system, you could write a custom recipe that marks the spans you’re interested in. If you use the ner_manual view ID, you’ll be able to correct the suggestions, letting you build an evaluation set efficiently.

1 Like

Ok. Thanks!