Get the start and end position of found named entities

Hi there

I am very new to ML and also Spacy in general. I am trying to show Named Entities from an input text.

This is my method:

def run():

    nlp = spacy.load('en_core_web_sm')
    sentence = "Hi my name is Oliver!"
    doc = nlp(sentence)

    #Threshold for the confidence socres.
    threshold = 0.2
    beams = nlp.entity.beam_parse(
        [doc], beam_width=16, beam_density=0.0001)

    entity_scores = defaultdict(float)
    for beam in beams:
        for score, ents in nlp.entity.moves.get_beam_parses(beam):
            for start, end, label in ents:
                entity_scores[(start, end, label)] += score
   
    #Create a dict to store output.
    ners = defaultdict(list)
    ners['text'] = str(sentence)

    for key in entity_scores:
        start, end, label = key
        score = entity_scores[key]
        if (score > threshold):
            ners['extractions'].append({
                "label": str(label),
                "text": str(doc[start:end]),
                "confidence": round(score, 2)
            })

    pprint(ners)

The above method works fine, and will print something like:

'extractions': [{'confidence': 1.0,
                'label': 'PERSON',
                'text': 'Oliver'}],
'text': 'Hi my name is Oliver'})

So far so good. Now I am trying to get the actual position of the found named entity. In this case "Oliver".

Looking at the documentation, there is: ent.start_char, ent.end_char available, but if I use that:

"start_position": doc.start_char,
"end_position": doc.end_char

I get the following error:

AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'start_char'

Can someone guide me in the right direction?

So I actually found an answer right after posting this question (typical).

I found that I didn't need to save the information into entity_scores, but instead just iterate over the actual found entities ent:

I ended up adding for ent in doc.ents: instead and this gives me access to all the standard Spacy attributes. See below:

ners = defaultdict(list)
ners['text'] = str(sentence)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for ent in doc.ents:
            if (score > threshold):
                ners['extractions'].append({
                    "label": str(ent.label_),
                    "text": str(ent.text),
                    "confidence": round(score, 2),
                    "start_position": ent.start_char,
                    "end_position": ent.end_char

My entire method ends up looking like this:

def run():
    nlp = spacy.load('en_core_web_sm')
    sentence = "Hi my name is Oliver!"
    doc = nlp(sentence)

    threshold = 0.2
    beams = nlp.entity.beam_parse(
        [doc], beam_width=16, beam_density=0.0001)

    ners = defaultdict(list)
    ners['text'] = str(sentence)
    for beam in beams:
        for score, ents in nlp.entity.moves.get_beam_parses(beam):
            for ent in doc.ents:
                if (score > threshold):
                    ners['extractions'].append({
                        "label": str(ent.label_),
                        "text": str(ent.text),
                        "confidence": round(score, 2),
                        "start_position": ent.start_char,
                        "end_position": ent.end_char
                    })

Hi! This is a forum dedicated to our annotation tool Prodi.gy. While the discussion often touches on spaCy, as spaCy support is built into Prodigy, it's not the right place for general usage questions around spaCy.

Stack Overflow is a better fit, and I see you've already posted your question and solution there :slightly_smiling_face: https://stackoverflow.com/questions/61895995/get-the-start-and-end-position-of-found-named-entities