NER for scraped data

I am working on scraped data from different urls but all related to a musical related urls. I am trying to achieve recognition of custom entities from scraped data. One cleaned scraped data and its entities is below.

('Celebrate! A Festive Season Opener - Mobile Symphony Orchestra Login Buy Tickets Donate Contact Season & Single Tickets on Sale Now! Arrive early for parking; downtown [more] Mobile Symphony Orchestra Mobile Symphony Orchestra Mailing Address: P.O. Box 3127 Mobile, AL 36652 If you would like to visit us in person: 257 Dauphin St. Mobile, AL 36602 Box Office: 251-432-2010 Menu HOME ABOUT US About Us Staff Board of Directors Employment MEMBERSHIPS CONCERTS & TICKETS Concerts 2021-2022 Season Online Concert Viewing Program Book Seating Chart Students Big Red Ticket Group Sales Take Note! Pre-Concert Talk Gift Ticket VISIT FAQ Hotel Partners Directions & Parking Merchant Partners Venue Info Larkins Music Center EDUCATION Programs Education Schedules About MSO Education MSO Education Supporters Educational Videos ORCHESTRA Scott Speck Scott on Russian Music Musicians At Home Performances Auditions Backstage YOUTH ORCHESTRA About MSYO Dr. Iv n del Prado MSYO Season Schedule MSYO Audition Information SUPPORT Sponsors Planned Giving Volunteer Donate This event has passed. Celebrate! A Festive Season Opener October 2 - October 3 | $20 $89 Single tickets on sale August 23, 2021 Saturday, October 2, 2021 | 7:30 p.m. Sunday, October 3, 2021 | 2:30 p.m At my very first concert as Music Director, we played Tchaikovsky s Symphony no. 4. It signaled the beginning of an era, and this feels like a new beginning as well. It s no secret that the Mobile Symphony loves to play Tchaikovsky, and our musicians will tackle the symphony with their trademark passion and force. Scott Speck This year, a new season gives us more reasons than ever to celebrate! Program: Adolphus Hailstork | Fanfare on Amazing Grace Samuel Ward | America the beautiful Tchaikovsky | Symphony No. 4 This concert is sponsored by: Laura Lee Pattillo Norquist Charitable Foundation The Metcalfe Charitable Trust All concerts are subject to change. Masks covering nose and mouth are required for this concert. Buy Tickets + Google Calendar+ iCal Export Details Start: October 2 End: October 3 Cost: $20 $89 Event Category: Classics Website: https://secure.mobilesymphony.org/TheatreManager/1/online Venue Saenger Theatre 6 South Joachim Street Mobile, AL, AL 36602 United States Phone: 251-432-2010 Related Events Bella Musica November 13 - November 14 The Fireworks of Jupiter January 22, 2022 - January 23, 2022 Beethoven and Blue Jeans March 12, 2022 - March 13, 2022 All Events Serenade Music City Hitmakers Join our E-List Name Email EmailThis field is for validation purposes and should be left unchanged. Concert schedule & Tickets Directions & Parking FAQ Donate education programs msyo schedule news archive Donate today. An investment in the MSO is not just a tax write off, it is an investment in your community. By donating to the arts, you are enriching the world around you. Donate Today (251) 432-2010 2019 - Mobile Symphony Orchestra | Powered by e-worc marketing and advertising sitemap | Privacy Policy Mobile Symphony',
     [(201, 226, 'Organization.Name'),
      (1763, 1777, 'CreativeWorks.OriginalTitle'),
      (2353, 2369, 'Performances.Date'),
      (1749, 1760, 'Persons.FullName'),
      (2414, 2428, 'Performances.Date'),
      (1251, 1259, 'Performances.Starttime'),
      (2431, 2439, 'Performances.Date'),
      (1727, 1748, 'CreativeWorks.OriginalTitle'),
      (1576, 1587, 'Persons.FullName'),
      (1667, 1685, 'Persons.FullName'),
      (2899, 2924, 'Organization.Name'),
      (1215, 1224, 'Performances.Starttime'),
      (1081, 1115, 'Productions.Name'),
      (830, 841, 'Persons.FullName'),
      (37, 62, 'Organization.Name'),
      (1713, 1724, 'Persons.FullName'),
      (0, 34, 'Productions.Name'),
      (1688, 1712, 'CreativeWorks.OriginalTitle'),
      (1197, 1212, 'Performances.Date'),
      (175, 200, 'Organization.Name'),
      (2180, 2195, 'Auditorium.Description'),
      (1233, 1248, 'Performances.Date')])

I have several thousands of these data. In the tagged list "CreativeWorks.OriginaTitles" are repeated thress times and the data is same for that tag. So should i repeat that tag or need to use only once. Also with these data, will this has a chance to work, as i havnt worked with such large data before.

Hi! It looks like you already have potential training data and it should be pretty easy to convert these scraped examples to spaCy's format and train a model: Training Pipelines & Models ยท spaCy Usage Documentation Once you've trained an initial model, you can run it over unseen data and see how it performs, and use a workflow like Prodigy's ner.correct to create more training examples and improve it.

One thing I'd recommend, though: I would start off using the more general entity labels like Person or Date instead of Persons.FullName or Performances.Date etc. You'd otherwise be making your life a lot harder, and the local context may not always provide enough clues for the model to make between these very fine-grained distinctions. You'll be able to achieve much better results if you're training on fewer, more general labels โ€“ and you can always add another step for the more fine-grained labels aftwards if you care about them.

You might even decide that you want to decide on your own label scheme and group some of the extracted categories together, or even remove some of them alltogether if they're not a good fit. The thing about structured data like this is that the categories assigned here are based on some classification scheme that wasn't necessarily designed for a machine learning model and what's most efficient to predict.

Also double-check that your scraped data doesn't contain any overlapping annotations, because named entities can't overlap.

If these are all separate instances, then yes, you definitely want those in the data! When you update the model with a text + entity annotations, you're basically telling it: "this is the correct information for all the tokens in the data". If you only tag one instance, what the model will learn is: "these words are not an entity", which is not what you want here.

1 Like