Job details extraction from html page

First of all thnks for your excellent APIs, tool and support service providing by you. I am a complete beginner to prodigy. i got a knowledge on how to use prodigy just by reading your tutorials, videos, and support forum. I am having several doubts. So before posting here, i went very deeply into this brilliant Prodigy support topics.

My task is to extract job title, salary, location, reference (not all jobs have this), company name, contact name (not all jobs have this), job description (paragraphs). According to the existing suggestions from Ines Montan and Matthew Honnibal, the job description will be a classification problem where as others are NER. I found couple of topics which are very close to my task.

  1. gather 1000 details page HTMLs
  2. remove sections like similar jobs, related jobs, recommended job, recently visited jobs, footer, menu, forms etc. and extract remaining text like new line separated sentences.
  3. start lebelling using prodigy with en_core_web_sm model

I am at early stage so very basic annotating question. Sometimes job title and company name present in multiple times. can i choose randomly anything or depends on context.

first.txt**
Recruitment Coordinator, 11 month FTC - Job ID: 896411 | Amazon.jobs | London
Recruitment Coordinator, 11 month FTC
Job ID: 896411 | Amazon Dev Centre (London) Ltd
Apply now
DESCRIPTION
Amazon’s Prime Video is a premium on-demand video entertainment service that offers customers the greatest choice in what to watch from popular Prime Original TV shows (made by Amazon Studios) such as The Grand Tour, Jack Ryan and the recent Golden Globe winning The Marvelous Mrs. Maisel to Prime Original Movies like the Oscar-winning Manchester by the Sea and The Salesman.
BASIC QUALIFICATIONS
· Experience multi-tasking in a fast paced, dynamic work environment.· Experience managing calendars using Outlook or a similar tool.· Experience with MS Word and Excel.· Bachelor’s degree or equivalent experience.
PREFERRED QUALIFICATIONS
· Goal-oriented and self-motivated.· Demonstrated commitment to customer service.· Highly organized with a keen attention to detail.· Strong verbal and written communication skills.· Ability to thrive in a fast-paced, quickly changing environment.·
Job details
London (Greater London Area), EnglandUnited Kingdom, Europe
Human Resources
© 1996-2019, Amazon.com, Inc. or its affiliates
first.txt**
second.txt**
Telesales job in Romford, Greater London | Travis Perkins plc group careers
Login
Telesales
Business:
Benchmarx Kitchens & Joinery
Sector:
Branch, Store & Showroom
Location:
Romford, Greater London
Salary:
£Competitive +Excellent Benefits
Hours of work:
Part Time - 22 Hours a week
Position type:
Permanent
Job type:
Part Time
Date posted:
10-Jul-2019
Job reference:
22698
Apply for this job
Shortlist
Job Description
Part time position- 22 hours a week
Joining our business as a Telesales/Customer Service Advisor; you will be responsible for new business generation. You will be in the branch calling new leads, calling lapsed clients and following up sales leads, creating brand awareness and generating new business interest and feeding this back to your branch to follow up.
In turn, we would train you up to be a Kitchen Designer to enable to you use CAD and progress you into a Kitchen Designer role if desired.
Benchmarx is a major supplier to the UK building trade. Part of the Travis Perkins Group who own the likes of Wickes, City Plumbing supplies, Tile Giant and many others, we pride ourselves on being a great place to work. We’re a top employer that looks after our people and empowers them to look after our business and our loyal customer base. Benchmarx was established in 2006 and already has 180 branches in the UK and are growing and expanding rapidly.
Alternative job titles that may be used for this role are: Business Development Executive / Business Developer / Lead Generator / Telesales / Sales Assistant / Customer Service Assistant
#LI-DNI
Apply for this job
Shortlist
Send this job by email
Email me jobs like this
Print job
example 2**

Does it matter where i annotate? which one is correct for second.txt:
image

OR

image

  1. job description is like paragraphs multi sentence annotation. i am doing like below. is that correct approach?. Do i need to annotate the line "Job Description" as well?

Hi! It does sound like you're on the right track and the NER problem vs. text classification problem distinction is definitely important.

There's not really an easy answer to that – ultimately, this comes down to what you want your model to learn and how you define your annotation scheme. That said, one problem you have here is that there are very few "real" sentences and you're mostly dealing with text fragments. If you only label the first or only the last instance of "Telesales", your model will likely get very confused, because it'd try to learn that "Telesales here is a job title and over here it's not", and somehow try to make sense of that. Similarly, if it's a location, label it as "location" consistently and then extract what type of location it is in a later step.

Hmm, that looks unnecessarily tedious? The result you're looking for is the start and end of the paragraph, right? And your goal is to then train a text classifier? If so, you could just split your incoming text into paragraphs and then for each paragraph, label whether it's part of a job description or not.

Again, that's up to you and your annotation scheme. But in your example, it's part of the job description and there's a pretty strong signal in there (even outside of the machine learning context). Everything between the "Job description" or "Description" headline up to the next headline is the job description – so it's possible that a rule-based approach that exploits this will perform better than any model you'll train.

Out of curiosity, why did you determine that this should be a Machine Learning task, vs. a scripted scraping task?

I've done a lot of similar scraping work, and unless you're working with a LOT of sites with wildly different information, it seems like this is the type of task that's better suited to just using BeautifulSoup to write a scraper, rather than trying to teach a new model how to parse HTML to give you job information.

Obviously, if you've found it's more effective/efficient to use ML, I'd love to learn more about that, so I can incorporate it into my workflow! :slight_smile:

yes i am doing for 1000s of different websites and each has different schema. Currently we using rule based scraping. We already have production applications using BeautifulSoup. But we end up using lot of rules for each website which is difficult to maintain. Sometimes sites changes their schema due to website redesign purposes or some other reasons so we cant scrape that website and need manual fixing which is end less. Also some websites dont use the proper tags so difficult to scrape. I will definetley let you know how i progress using ML.

Ines Montani After posting my last post, i came to know more about prodigy and doing lot of labelling. Thanks for your suggestions. Hopefully by the end of this month, i can use batch-train.