post-editing (PE) of machine-translated (MT)

Hi,
I need to know, is there any way that we use Prodigy for
post-editing (PE) of machine-translated (MT) texts?

imagine, I have a corpus in JSONL format that has been translated by MT from English to Persian.(Translation has JSONL format) is there any way to edit that after translation, in prodigy ? if not, do you have any framework in mind in line with spaCY in python?

Many thanks in advance

I would also like this, but I don't think prodigy (currently) intends to be a CAT-tool.

1 Like

thank you for your response, do you know any other one that you can recommend?

@BramVanroy

One way to do this could be to use the blocks UI wtih two blocks: a text block for the original input text and a text_input for the editable output text: https://prodi.gy/docs/api-interfaces#text_input

In the underlying JSON data, you could store the original translation in a separate key as well, so you'll keep the reference to the original output, plus the edited translation (if the annotator makes edits).

(If you wanted to add a separate review step later, you could even use the diff UI to visualize the original translation and any edits made to it: https://prodi.gy/docs/api-interfaces#diff)

1 Like

thank you very much for your quick response, it looks exactly, what I want. I read the links, I did not get how to start...

can you elaborate a bit on how can I use block UI ?

it looks here...

I have now a jsonl file of my translation by MT, which command should I use?
thank you in advance

I have changed the code in ner.manual prodigy as follows:

blocks = [{"view_id": "ner_manual"},{"view_id": "text_input", "field_rows": 3, "field_label": "edit the text"}]
    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": remove_tokens if highlight_chars else None,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "exclude_by": "input",
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
            "blocks": blocks
        },
    }

now it is working !~:)

I also found the solution to have

ltr to ---> rtl (right to left) by changing that in prodigy.config

"writing_dir": "rtl"

using this command

!python -m prodigy ner.manual PersianAMTV ../data/blank-farsi-model "../data/PersianAMT.jsonl" --label "PREMISE","CLAIM" --highlight-chars

I can have this

which is very good! how does it sound?

now, this questions are comming

0- i am using ner.manual but I do not want to use any ner labels, is that correct what I am doing?
1-autofill the editing part with the same text, in a way that annotator(translator) can change a bit that

2-how can i add original text to (which is another jsonl file) to this

many many thank

Hi, @myeghaneh !

It is possible to use your own labels when using ner.manual. Also, another tip: if you're labelling PREMISE and CLAIM, try using spans.manual instead. It may be better to your use-case because you're not dependent upon token boundaries + some of your spans might overlap.

It should still be possible with a custom interface. The field_id also lets you customize the key in the JSON where the data is stored in. Then you can pre-populate that in the data with some text.

You can always load your database with another file. Better yet, you can write a data transformation script to combine all these JSONL files together and load it in one go.

1 Like

thank you for your quick and great response !

the problem with my post was is that I mixed two things, try to explain

The first aim is to translate the text from English to Persian. I used an API but as you know the result is not very good, so i decided to use prodigy to "Post Editing" the result of translation. so therefore in this level I do not need my NER labels.

so I now I have JSONL file contain translate by "MT"

1- as you see using the code above, i managed to add that input-text box, but I do not know how to populate it with the "default" text, this is the last set up for ner.manual

blocks = [{"view_id": "text"},{"view_id": "text_input", "field_id":"Edit", "field_rows": 4, "field_label": "Edit the text", "field_autofocus": True}]
    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": remove_tokens if highlight_chars else None,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "exclude_by": "input",
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
            "blocks": blocks
        },
    }

I used this command

!python -m prodigy ner.manual PersianAMTV ../data/blank-farsi-model "../data/PersianAMT.jsonl" --label "PREMISE","CLAIM" --highlight-chars    

which here as mentioned, I do not need "claim and premise " label...I only want to insert the edit of translation (maybe also the English version, which I guess for that I need to add that to my database as you mentioned ) and save them and the end has them for the next level which is NER annotation. It is a bit confusing I know ...but generally what I want to do at this level is only "Post editing " of the result of my Machine translation using prodigy and have josnl file of "Post edited version " which in the next level I want to annotate then for premsie and claim part.

by the way this "ltr" and "rtl" for the text with mixed language (English and Persian) is also another problem :slight_smile:

thank you in advance @ljvmiranda921 ljvmiranda921

i would be happy to know your idea

many thanks in advance

Hi @myeghaneh !

For that, I can point you to the image captioning tutorial, especially the image_caption_correct recipe. It does exactly what you aim to do. You can check the field_id parameter, you can then use it to customize the key in the JSON that the data is stored in then pre-populate that in the data with some text.

I saw that you opened another issue for it (thanks!). I'll answer there :+1:

1 Like