I am working on a project where I am supposed to match company names that were collected in free-text format. There are many different cases that I need to consider such as abbreviation, misspelling, company extensions etc.
I want to use prodi.gy in a way to decide whether two different text input are the same or not.
Is there a way to use prodi.gy in that type of task?
It perhaps would be easier if you could show us a concrete example of the input and the expected information to be extracted? I'm not sure if you want to train a model that predicts whether two pieces of text are similar or whether the extracted company name is an abbreviation/misspelling of another company name mentioned in the same text or some external database of company names?
It would help, if you could show us the input and expected information you want annotated - thanks!
Hi @omurbali ,
Apologies for the delay in response!
Do you already have a way to associate the company_freetext_by_user to company_registered_name or do you need to use NLP to generate a list of potential registered names that likely match the freetext input?
If you already have the data associated in the way you presented, it should be fairly easy to write a simple custom recipe that mixes html and choice interfaces and lets the annotator choose between two labels:
import prodigy
import srsly
from prodigy.components.db import connect
from prodigy.util import set_hashes
def make_tasks(stream):
for example in stream:
free_text = example.get("company_freetext_by_user")
registered_name = example.get("company_registered_name")
# Transform the raw input into html to be rendered in the UI
html = f"<p>{free_text}</p><p>{registered_name}</p>"
# Add options with the labels to choose from
options = [{"id": 0, "text": "Same"}, {"id": 1, "text": "Different"}]
task = {"html": html, "options": options}
task = set_hashes(task)
yield task
@prodigy.recipe(
"compare.company.names",
dataset=("The dataset to use", "positional", None, str),
source=("The source data as a JSONL file", "positional", None, str),
)
def minimal_recipe(
dataset: str,
source: str,
):
stream = srsly.read_jsonl(source)
stream = make_tasks(stream)
blocks = [{"view_id": "html"}, {"view_id": "choice", "html": None}]
return {
"view_id": "blocks",
"dataset": dataset, # Name of dataset to save annotations
"stream": stream,
"config": {
"blocks": blocks
}
}
One idea to improve this basic interface, would be to use some string similarity metric e.g Levenshtein distance to suggest whether the two names are likely the same or different. If they are the same, according to the metric, you could render them both with a green background otherwise with a red background or something similar. I recommend checking out our docs on custom html & css to see what's possible.
This logic would be added to the make_tasks function and you could easily use a third party Python library to help you with Levensthein distance calculation e.g. this one.