Is it possible to merge 2 label into 1? & How to add a corpus into specific label

Hi, i accidentally create two label with same type of value when continuing my previous process. Let's say it is "SERVICE" and "SERVICES". Is it possible to merge its label's data into 1 label ? Or i just need some script to replace all "services" word to service after exporting ? but i think it'll affect the data as well, i'm not sure though :frowning:

and let's say i also has a "LOCATION" label, and i want to add some location corpus into it that i got from the client, how to accomplish this?

Thank you! sorry for my bad English.

Yes, I would recommend to export the data with db-out, open it in your editor and replace "SERVICE" with "SERVICES". It's just the label name, so it shouldn't cause any problems with the data. When you're done, import the edited data to a new dataset.

(Prodigy doesn't allow just changing an existing set, because your data would then be out-of-sync with what the annotator saw. It would also make it easier to accidentally lose data, which is bad. So if you want to edit data manually, you'll need to export and import to a new set.)

Is the location corpus from your client already annotated? If so, you can convert it to Prodigy's format and then import it to the dataset using the db-in command. You can find more details on the JSON format in the "Annotation task formats" section of your PRODIGY_README.html.

It should hopefully be easy to write a small script that converts your data. For NER, you need the original text, and the start/end character offsets and label for each entity. For example:

{
    "text": "Apple updates its analytics service with new metrics",
    "spans": [{ "start": 0, "end": 5, "label": "ORG" }]
}

Maybe try it with a new dataset and a small sample first, to make sure it all works correctly :slightly_smiling_face:

Unfortunately, it is not, it's raw. List names of the location of my country... so i guess the solution is just to create a script to my location list into prodigy json format right?

Ah okay! So you have raw text plus a location list? Then you could have a small script that loads the raw data and adds the spans by matching the locations (using regular expressions or something like spaCy's rule-based matcher).

You can then either import it to the dataset directly, or load the data with ner.manual so you can correct the spans first. (This depends on the quality of the location list and the data.)

Ho can i insert it into the dataset directly because it's so many , and it just contain the location per line in a file