post-editing (PE) of machine-translated (MT)

I need to know, is there any way that we use Prodigy for
post-editing (PE) of machine-translated (MT) texts?

imagine, I have a corpus in JSONL format that has been translated by MT from English to Persian.(Translation has JSONL format) is there any way to edit that after translation, in prodigy ? if not, do you have any framework in mind in line with spaCY in python?

Many thanks in advance

I would also like this, but I don't think prodigy (currently) intends to be a CAT-tool.

1 Like

thank you for your response, do you know any other one that you can recommend?


One way to do this could be to use the blocks UI wtih two blocks: a text block for the original input text and a text_input for the editable output text:

In the underlying JSON data, you could store the original translation in a separate key as well, so you'll keep the reference to the original output, plus the edited translation (if the annotator makes edits).

(If you wanted to add a separate review step later, you could even use the diff UI to visualize the original translation and any edits made to it:

1 Like

thank you very much for your quick response, it looks exactly, what I want. I read the links, I did not get how to start...

can you elaborate a bit on how can I use block UI ?

it looks here...

I have now a jsonl file of my translation by MT, which command should I use?
thank you in advance

I have changed the code in ner.manual prodigy as follows:

blocks = [{"view_id": "ner_manual"},{"view_id": "text_input", "field_rows": 3, "field_label": "edit the text"}]
    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": remove_tokens if highlight_chars else None,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "exclude_by": "input",
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
            "blocks": blocks

now it is working !~:)

I also found the solution to have

ltr to ---> rtl (right to left) by changing that in prodigy.config

"writing_dir": "rtl"

using this command

!python -m prodigy ner.manual PersianAMTV ../data/blank-farsi-model "../data/PersianAMT.jsonl" --label "PREMISE","CLAIM" --highlight-chars

I can have this

which is very good! how does it sound?

now, this questions are comming

0- i am using ner.manual but I do not want to use any ner labels, is that correct what I am doing?
1-autofill the editing part with the same text, in a way that annotator(translator) can change a bit that

2-how can i add original text to (which is another jsonl file) to this

many many thank

Hi, @myeghaneh !

It is possible to use your own labels when using ner.manual. Also, another tip: if you're labelling PREMISE and CLAIM, try using spans.manual instead. It may be better to your use-case because you're not dependent upon token boundaries + some of your spans might overlap.

It should still be possible with a custom interface. The field_id also lets you customize the key in the JSON where the data is stored in. Then you can pre-populate that in the data with some text.

You can always load your database with another file. Better yet, you can write a data transformation script to combine all these JSONL files together and load it in one go.

1 Like

thank you for your quick and great response !

the problem with my post was is that I mixed two things, try to explain

The first aim is to translate the text from English to Persian. I used an API but as you know the result is not very good, so i decided to use prodigy to "Post Editing" the result of translation. so therefore in this level I do not need my NER labels.

so I now I have JSONL file contain translate by "MT"

1- as you see using the code above, i managed to add that input-text box, but I do not know how to populate it with the "default" text, this is the last set up for ner.manual

blocks = [{"view_id": "text"},{"view_id": "text_input", "field_id":"Edit", "field_rows": 4, "field_label": "Edit the text", "field_autofocus": True}]
    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": remove_tokens if highlight_chars else None,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "exclude_by": "input",
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
            "blocks": blocks

I used this command

!python -m prodigy ner.manual PersianAMTV ../data/blank-farsi-model "../data/PersianAMT.jsonl" --label "PREMISE","CLAIM" --highlight-chars    

which here as mentioned, I do not need "claim and premise " label...I only want to insert the edit of translation (maybe also the English version, which I guess for that I need to add that to my database as you mentioned ) and save them and the end has them for the next level which is NER annotation. It is a bit confusing I know ...but generally what I want to do at this level is only "Post editing " of the result of my Machine translation using prodigy and have josnl file of "Post edited version " which in the next level I want to annotate then for premsie and claim part.

by the way this "ltr" and "rtl" for the text with mixed language (English and Persian) is also another problem :slight_smile:

thank you in advance @ljvmiranda921 ljvmiranda921

i would be happy to know your idea

many thanks in advance

Hi @myeghaneh !

For that, I can point you to the image captioning tutorial, especially the image_caption_correct recipe. It does exactly what you aim to do. You can check the field_id parameter, you can then use it to customize the key in the JSON that the data is stored in then pre-populate that in the data with some text.

I saw that you opened another issue for it (thanks!). I'll answer there :+1:

1 Like