named entity extraction wrong

I Manually trained few entities like Skill, Role, Employer …

After training and extracting model for given below sample text


text="""s rocket eNgine combustion planner system using Python with the usage of restapi 
written in,Flask,with computation package written in, C++, and maintained database in,Amazon Redshift,with the help of,PostgreSQl,"""

doc = nlp(text)
    label_list = [str(l.label_) for l in doc.ents]
    data = {}
    for label in label_list:
        data[label] = [str(e) for e in doc.ents if e.label_ == label]

    skills = ""

    if 'SKILL' in data:
        skills = ','.join(data["SKILL"])
        print(skills)

It is working properly and giving correct output as below.

output:: Python,Flask,C,database,Amazon Redshift,PostgreSQL

But when I tried extracting skill from sentences hanving skills not separated by commas, it is unable to extract correct skills as output below.
Output: : flask,PostgreSQL,Amazon Redshift

sample text = “”“I worked on NASA’s rocket engine planner system using Python with the usage of restapi written in Flask with computation package written in C++ and maintained database in Amazon Redshift with the help of PostgreSQL”""

please help me how to extract skills without commas.

Thanks in advance.
Cheers,
Shiva

If all of your training examples have commas around the entities, then it makes sense that the model would learn that.

The best solution would be to do some text pre-processing to clean up the commas in your data. If you do think they’ll be useful, you can add two copies of the text to your training data: one copy where the commas are present, and one copy where the commas are absent. You’ll just have to adjust the span annotations so that the offsets remain correct after you’ve done the string processing.

how to clean up the commas data

please send me any python code clean up commas

Thanks in advance.
Cheers,
Shiva

I’m sorry but general Python programming support is outside of the scope of the help we can offer. You could try looking for a consultant here: spaCy/prodigy consultants? , or perhaps more generally on a site like freelancer.com

Hi Matthew,

Thanks for the response.

In fact i was able to find the skill 'Flask' in the pre-trained jsonl file.

{"label":"SKILL","pattern":[{"lower":"database"}]}
{"label":"SKILL","pattern":[{"lower":"restapi"}]}
{"label":"SKILL","pattern":[{"lower":"Flask"}]}
{"label":"SKILL","pattern":[{"lower":"PostgreSQl"}]}
{"label":"SKILL","pattern":[{"lower":"Amazon Redshift"}]}
{"label":"SKILL","pattern":[{"lower":"amzon redshifit"}]}
{"label":"SKILL","pattern":[{"lower":"amazonredshift"}]}

But it's not identifying the keyword flask for given phrase.

> text = """I worked on NASA's rocket eNgine combustion planner system using Python with the usage of restapi " \
> written in Flask with computation package written in C++ and maintained database in Amazon Redshift with
> the help of PostgreSQl"""
> skill_set = nlp(text)
> for skill in skill_set.ents:
>     print(skill)

it successfully identifies 4 skills:

NASA
Python
C++
database

But unable to identify skills like Flask, Restapi, PostgreSQl and Amazon redshift. Even the skills are in trained json and given input. And yes as you mentioned we pre-processed the text before training it. We have removed extra space, commas, etc...

Can you please tell me where i'm doing wrong.

Thanks in Advance.

Hi Matthew,

In fact i was able to find the skill ‘Flask’ in the pre-trained jsonl file.

{“label”:“SKILL”,“pattern”:[{“lower”:“database”}]}
{“label”:“SKILL”,“pattern”:[{“lower”:“restapi”}]}
{“label”:“SKILL”,“pattern”:[{“lower”:“Flask”}]}
{“label”:“SKILL”,“pattern”:[{“lower”:“PostgreSQl”}]}
{“label”:“SKILL”,“pattern”:[{“lower”:“Amazon Redshift”}]}
{“label”:“SKILL”,“pattern”:[{“lower”:“amzon redshifit”}]}
{“label”:“SKILL”,“pattern”:[{“lower”:“amazonredshift”}]}

But it’s not identifying the keyword flask for given phrase.

text = """I worked on NASA's rocket eNgine combustion planner system using Python with the usage of restapi " \
written in Flask with computation package written in C++ and maintained database in Amazon Redshift with
the help of PostgreSQl"""
skill_set = nlp(text)
for skill in skill_set.ents:
	print(skill)

it successfully identifies 4 skills:

NASA
Python
C++
database

But unable to identify skills like Flask, Restapi, PostgreSQl and Amazon redshift. Even the skills are in trained json and given input. And yes as you mentioned we pre-processed the text before training it. We have removed extra space, commas, etc…

Can you please tell me where i’m doing wrong.

Thanks in Advance.