Hi! I am trying spaCy and Prodigy for the first time, I would like to make a NER model to extract skills from english text. I have some doubts about the correct workflow I should follow for this task.
These are the passages I've done so far:
-
Created a new dataset with the command:
python -m prodigy dataset linkedin_skills_dataset "NER skills dataset"
-
Start NER manual annotation from a a text file (one sentence per line):
python -m prodigy ner.manual linkedin_skills_dataset en_core_web_sm D:\jobs.txt --label "SKILL"
-
Export the annotations:
python -m prodigy db-out linkedin_skills_dataset /tmp
At this point I notice something strange in the output annotations. For example for the following sentence
We are looking for an expert Hadoop developer.
where I've annotated the word Hadoop
with the label SKILL
, I got the following in the output file:
{"spans":[{"token_end":6,"end":35,"token_start":6,"label":"SKILL","start":29}],"answer":"accept","_view_id":"ner_manual","_input_hash":240160175,"tokens":[{"id":0,"start":0,"end":2,"text":"We"},{"id":1,"start":3,"end":6,"text":"are"},{"id":2,"start":7,"end":14,"text":"looking"},{"id":3,"start":15,"end":18,"text":"for"},{"id":4,"start":19,"end":21,"text":"an"},{"id":5,"start":22,"end":28,"text":"expert"},{"id":6,"start":29,"end":35,"text":"Hadoop"},{"id":7,"start":36,"end":45,"text":"developer"},{"id":8,"start":45,"end":46,"text":"."}],"text":"We are looking for an expert Hadoop developer.","_task_hash":1289245207,"_session_id":"linkedin_skills_dataset-default"}
Is it normal that all the tokens are included in the output? This seems different from the annotated file of this tutorial: Improve a Named Entity Model. This is in particular the first line:
{"text":"This was taken during the Easter celebrations at Real de Catorce, MX.","spans":[{"start":66,"end":68,"text":"MX","rank":0,"label":"PRODUCT","score":0.9525150387,"source":"core_web_sm","input_hash":19964311,"answer":"reject"}],"meta":{"section":"photography","score":0.9525150387},"_input_hash":19964311,"_task_hash":1331479932,"answer":"reject"}
From what I can see the tokens are not included, only the spans. Am I missing some parameter in the output command?
Also I would like to ask if for this task I should use a blank model or it is ok to start with pre-existing spaCy model like en_core_web_sm
. If it is better to start with a blank model, which is the right command to do so?
I've tried omitting the spacy_model
parameter in the ner.manual
command but got an error:
python -m prodigy ner.manual linkedin_skills_dataset D:\jobs.txt --label "SKILL"
Error: -> OSError: [E050] Can't find model 'D:\jobs.txt'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Thank you for any info!