post-editing (PE) of machine-translated (MT)

myeghaneh · November 11, 2021, 9:11pm

Hi,
I need to know, is there any way that we use Prodigy for
post-editing (PE) of machine-translated (MT) texts?

imagine, I have a corpus in JSONL format that has been translated by MT from English to Persian.(Translation has JSONL format) is there any way to edit that after translation, in prodigy ? if not, do you have any framework in mind in line with spaCY in python?

Many thanks in advance

BramVanroy · November 12, 2021, 11:37am

I would also like this, but I don't think prodigy (currently) intends to be a CAT-tool.

myeghaneh · November 12, 2021, 11:42am

thank you for your response, do you know any other one that you can recommend?

@BramVanroy

ines · November 14, 2021, 10:35am

One way to do this could be to use the blocks UI wtih two blocks: a text block for the original input text and a text_input for the editable output text: https://prodi.gy/docs/api-interfaces#text_input

In the underlying JSON data, you could store the original translation in a separate key as well, so you'll keep the reference to the original output, plus the edited translation (if the annotator makes edits).

(If you wanted to add a separate review step later, you could even use the diff UI to visualize the original translation and any edits made to it: https://prodi.gy/docs/api-interfaces#diff)

myeghaneh · November 15, 2021, 11:42am

thank you very much for your quick response, it looks exactly, what I want. I read the links, I did not get how to start...

can you elaborate a bit on how can I use block UI ?

it looks here...

I have now a jsonl file of my translation by MT, which command should I use?
thank you in advance

I have changed the code in ner.manual prodigy as follows:

blocks = [{"view_id": "ner_manual"},{"view_id": "text_input", "field_rows": 3, "field_label": "edit the text"}]
    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": remove_tokens if highlight_chars else None,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "exclude_by": "input",
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
            "blocks": blocks
        },
    }

now it is working !~:)

I also found the solution to have

ltr to ---> rtl (right to left) by changing that in prodigy.config

"writing_dir": "rtl"

using this command

!python -m prodigy ner.manual PersianAMTV ../data/blank-farsi-model "../data/PersianAMT.jsonl" --label "PREMISE","CLAIM" --highlight-chars

I can have this

which is very good! how does it sound?

now, this questions are comming

0- i am using ner.manual but I do not want to use any ner labels, is that correct what I am doing?
1-autofill the editing part with the same text, in a way that annotator(translator) can change a bit that

2-how can i add original text to (which is another jsonl file) to this

many many thank

ljvmiranda921 · November 16, 2021, 10:07am

Hi, @myeghaneh !

It is possible to use your own labels when using ner.manual. Also, another tip: if you're labelling PREMISE and CLAIM, try using spans.manual instead. It may be better to your use-case because you're not dependent upon token boundaries + some of your spans might overlap.

It should still be possible with a custom interface. The field_id also lets you customize the key in the JSON where the data is stored in. Then you can pre-populate that in the data with some text.

You can always load your database with another file. Better yet, you can write a data transformation script to combine all these JSONL files together and load it in one go.

myeghaneh · November 16, 2021, 10:25am

thank you for your quick and great response !

the problem with my post was is that I mixed two things, try to explain

The first aim is to translate the text from English to Persian. I used an API but as you know the result is not very good, so i decided to use prodigy to "Post Editing" the result of translation. so therefore in this level I do not need my NER labels.

so I now I have JSONL file contain translate by "MT"

1- as you see using the code above, i managed to add that input-text box, but I do not know how to populate it with the "default" text, this is the last set up for ner.manual

blocks = [{"view_id": "text"},{"view_id": "text_input", "field_id":"Edit", "field_rows": 4, "field_label": "Edit the text", "field_autofocus": True}]
    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": remove_tokens if highlight_chars else None,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "exclude_by": "input",
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
            "blocks": blocks
        },
    }

I used this command

!python -m prodigy ner.manual PersianAMTV ../data/blank-farsi-model "../data/PersianAMT.jsonl" --label "PREMISE","CLAIM" --highlight-chars

which here as mentioned, I do not need "claim and premise " label...I only want to insert the edit of translation (maybe also the English version, which I guess for that I need to add that to my database as you mentioned ) and save them and the end has them for the next level which is NER annotation. It is a bit confusing I know ...but generally what I want to do at this level is only "Post editing " of the result of my Machine translation using prodigy and have josnl file of "Post edited version " which in the next level I want to annotate then for premsie and claim part.

by the way this "ltr" and "rtl" for the text with mixed language (English and Persian) is also another problem

thank you in advance @ljvmiranda921 ljvmiranda921

i would be happy to know your idea

many thanks in advance

ljvmiranda921 · November 18, 2021, 1:00am

Hi @myeghaneh !

For that, I can point you to the image captioning tutorial, especially the image_caption_correct recipe. It does exactly what you aim to do. You can check the field_id parameter, you can then use it to customize the key in the JSON that the data is stored in then pre-populate that in the data with some text.

I saw that you opened another issue for it (thanks!). I'll answer there

Topic		Replies	Views
Customize recipe for text generation tasks usage , solved	3	349	May 22, 2022
Labels Translation in Prodigy UI enhancement , usage , front-end , solved	6	643	November 19, 2021
How can I correct my annotations using the NER.manual recipe?	5	251	May 22, 2023
prodigy use case for annotation having pre-annotated text usage , solved	8	1263	March 11, 2019
Field report: Noisy translation data annotation and nginx-proxy deployment usage , custom , server	1	1351	January 17, 2022

post-editing (PE) of machine-translated (MT)

Related topics