Matching Company Names

omurbali · May 31, 2024, 2:21pm

Hi there,

I am working on a project where I am supposed to match company names that were collected in free-text format. There are many different cases that I need to consider such as abbreviation, misspelling, company extensions etc.
I want to use prodi.gy in a way to decide whether two different text input are the same or not.

Is there a way to use prodi.gy in that type of task?

I would appreciate any type of guidence.

Best

magdaaniol · June 3, 2024, 6:43am

Welcome to the forum @omurbali!

It perhaps would be easier if you could show us a concrete example of the input and the expected information to be extracted? I'm not sure if you want to train a model that predicts whether two pieces of text are similar or whether the extracted company name is an abbreviation/misspelling of another company name mentioned in the same text or some external database of company names?
It would help, if you could show us the input and expected information you want annotated - thanks!

omurbali · June 4, 2024, 6:24am

Hi there,
The second definition suits better to our case.
Here is a sample data where label tells us whether two names are the same or not

{
{
"company_freetext_by_user": "Yapı Merkezi İnşaat Ve Sanayi A.Ş.",
"company_registered_name": "Yapı Merkezi İnşaat ve Sanayi A.Ş.",
"label": 1
},
{
"company_freetext_by_user": "Sendeo Dağıtım Hizmetleri A.S.",
"company_registered_name": "Sendeo Dağıtım Hizmetleri Anonim Şirketi",
"label": 1
},
{
"company_freetext_by_user": "Medicalpark Hastanesi",
"company_registered_name": "Medical Park Hastaneler Grubu",
"label": 1
},
{
"company_freetext_by_user": "Atasu nOptik",
"company_registered_name": "ATASUN OPTİK",
"label": 1
},
{
"company_freetext_by_user": "Yıldız Holding As",
"company_registered_name": "ANADOLU HOLDİNG A.Ş.",
"label": 0
}

magdaaniol · June 10, 2024, 8:46am

Hi @omurbali ,
Apologies for the delay in response!
Do you already have a way to associate the company_freetext_by_user to company_registered_name or do you need to use NLP to generate a list of potential registered names that likely match the freetext input?
If you already have the data associated in the way you presented, it should be fairly easy to write a simple custom recipe that mixes html and choice interfaces and lets the annotator choose between two labels:

import prodigy
import srsly
from prodigy.components.db import connect
from prodigy.util import set_hashes

def make_tasks(stream):
    for example in stream:
        free_text = example.get("company_freetext_by_user")
        registered_name = example.get("company_registered_name")
        # Transform the raw input into html to be rendered in the UI
        html = f"<p>{free_text}</p><p>{registered_name}</p>"
        # Add options with the labels to choose from
        options = [{"id": 0, "text": "Same"}, {"id": 1, "text": "Different"}]
        task = {"html": html, "options": options}
        task = set_hashes(task)
        yield task

@prodigy.recipe(
    "compare.company.names",
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
)
def minimal_recipe(
    dataset: str,
    source: str,
):
    stream = srsly.read_jsonl(source)
    stream = make_tasks(stream)

    blocks = [{"view_id": "html"}, {"view_id": "choice", "html": None}]
    return {
        "view_id": "blocks",
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,
        "config": {
        "blocks": blocks
        }
    }

This would result in something like this:

One idea to improve this basic interface, would be to use some string similarity metric e.g Levenshtein distance to suggest whether the two names are likely the same or different. If they are the same, according to the metric, you could render them both with a green background otherwise with a red background or something similar. I recommend checking out our docs on custom html & css to see what's possible.
This logic would be added to the make_tasks function and you could easily use a third party Python library to help you with Levensthein distance calculation e.g. this one.

omurbali · June 10, 2024, 1:41pm

Thank you for your detailed guidance. I'll try to implement it - and let you know if I will have further problems.

Best

Topic		Replies	Views
Company name matching usage , ner	1	1325	March 16, 2020
Manual Input of Entities to a prodigy database usage , ner , solved	5	432	July 10, 2021
Prodigy for labeling similar sentences usage	2	920	October 18, 2018
concepts representation usage , ner	4	377	October 11, 2020
Should i accept it? usage , ner , solved	6	614	April 26, 2019

Matching Company Names

Related topics