Handling partially annotated spreadsheet data?

Hello, I have perhaps somewhat of a unique situation.

I’m trying to extract named entities from niche product data coming from a number of different sources. Each source is either CSV, Excel, JSON, or XML. Therefore for each row value I have a column header to provide some context. However these columns are inconsistent from source to source. Suppose I have three entity types I want to extract, e.g. Size, Product Name, Brand Name, they will in some sources appear in separate columns, but in other sources be a single value like “Some_Brand Some_Product, Large”.

Now, I can just join together all my data, and ignore the column headers, but I don’t know if that is wise. An entity I’m trying to recognize like “Price” for instance can easily be confused with “Size” or some numeric value when it is devoid of the context.

Compound this all by saying that “column headers” (in quotes because I have JSON and XML data too, so not exactly spreadsheet style data) are of course inconsistent.

I’d like to purchase prodigy for at least a month to see if I can tackle this problem, but I suppose I’d like to know in advance if there is a strategy I can use to ‘hint’ the POS.

Ideas? Thank you for any help you can provide, understanding how domain specific NER is done has been… quite a journey…

Having data from diverse sources in a variety of formats is no problem. Simply feed in data from one source at a time, annotate, and then stop the server when you want to queue up the next set of data. This way you don’t have to think too hard about making things uniform. You can just have different code to process the different data.

Ahh, sorry, I guess I didn’t fully explain my use case, I currently have around sixty sources, in the future, I’ll have many many many more, and I don’t want to have to re-train / annotate for each one.

Our goal is using an NLP approach to avoid having to hard code a mapping between columns and how to interpret them since it is so inconsistent and the data varies so much from provider to provider.

Essentially we’re processing inventory data from vendors who are signing up all the time. In order to onboard them quickly, we need to process their inventory, in whatever format they have and extract several key attributes so we can match each row against a canonical product db.

However, additionally, vendors often have products which are NOT in our canonical db, so there is tremendous value for us in being able to extract entities and then create new canonical data from our vendors so that our universe of products expands on a go-forward basis.

I was trying to more abstractly describe the situation before, but I suppose all this stuff is very case-by-case.

What I’ve been envisioning is joining together each row’s values, but then using the ‘headers’ to partially tag/label the joined valued somehow. Maybe I’m just confusing myself or there is a simpler solution, like just excising fields altogether if I already know what they are (price e.g.), but a generic solution is really what I’m looking for so we don’t have to annotate for each nth vendor.