Relation Extraction annotation tests

Hi @ines,

In this other thread, you told me the following:

When you’re done with that, you could export the data and run one experiment where you link up highlighted spans close to each other and collect binary feedback on whether they are connected. You could either create data in the dep format (with a head and child), or use the choice interface with the options “For”, “Against”, “Support” and “Attack”.

So I was testing the two approaches, but I'm having a hard time making them work.

I'm first testing mapping relations between spans of texts using dep.teach, but got nowhere. Here's what I did so far:

  1. First I tried loading a sample jsonl into dep.teach (with head and child, and a custom 'supports' relation) using the following command:

    prodigy dep.teach essays_relations en_core_web_sm example.jsonl --unsegmented

The jsonl content is:

{"text": "Should students be taught to compete or to cooperate?\n\nIt is always said that competition can effectively promote the development of economy. In order to survive in the competition, companies continue to improve their products and service, and as a result, the whole society prospers. However, when we discuss the issue of competition or cooperation, what we are concerned about is not the whole society, but the development of an individual's whole life. From this point of view, I firmly believe that we should attach more importance to cooperation during primary education.\nFirst of all, through cooperation, children can learn about interpersonal skills which are significant in the future life of all students. What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others. During the process of cooperation, children can learn about how to listen to opinions of others, how to communicate with others, how to think comprehensively, and even how to compromise with other team members when conflicts occurred. All of these skills help them to get on well with other people and will benefit them for the whole life.\nOn the other hand, the significance of competition is that how to become more excellence to gain the victory. Hence it is always said that competition makes the society more effective. However, when we consider about the question that how to win the game, we always find that we need the cooperation. The greater our goal is, the more competition we need. Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could win the game without the training of his or her coach, and the help of other professional staffs such as the people who take care of his diet, and those who are in charge of the medical care. The winner is the athlete but the success belongs to the whole team. Therefore without the cooperation, there would be no victory of competition.\nConsequently, no matter from the view of individual development or the relationship between competition and cooperation we can receive the same conclusion that a more cooperative attitudes towards life is more profitable in one's success.", "data": {"head": "the significance of competition is that how to become more excellence to gain the victory", "dep": "supports", "child": "competition makes the society more effective"}}

The tool loads the text ok, presenting the full text, but it still tries to do the regular dependency task annotation, highlighting random single-word tokens and using the regular dependency parsing labels. The example I got from the documentation was using ids for tokens, but since I'm predicting dependencies between spans, instead of tokens, I changed the sample to contain the full span text instead of a token id.

The same jsonl in ner_manual format I created is here (28.8 KB). I have all these spans that could be marked with some sort of relation between them.

  1. As for using the choice interface, I'm not sure what sort of annotation format I should present to prodigy either. Since I'm trying to annotate pairs of spans, I'm thinking that I should present annotated pairs from my dataset, but I'm not seeing how a trained model would handle this later on. However, it would only make sense to annotate pairs of spans, if they were highlighted in the original raw text, so the annotator could see them in context. How could I have this working on prodigy: full raw text, with two highlighted spans and my labels presented as options?

Thanks!

Yes, by default, the dep.teach recipe will use the dependency labels available in the model, and will ask you about its existing predictions. So if you use a regular spaCy model, those will be the syntactic dependencies. (If you wanted an active learning-powered workflow for your use case, you'd have to pre-train a dependency parser using your custom labels).

Another problem is that the dep interface currently expects to link up individual tokens instead of spans. This is something that makes the interface less useful, because you'd have to focus on the syntactic heads of the phrases, and it'd make it less obvious what the original spans are.

Yes, that's possible and probably the best option, now that I think about the problem again. The "choice" interface can render regular text, text with spans, images or HTML as both the input and the options. So the task would consist of the text and two selected spans, and the options would list the labels you're annotating. For more details, see the "Annotation task formats" section in your PRODIGY_README.html. Here's an abstract example:

{
    "text": "Some long text with two spans here...",
    "spans": [
        {"start": 0, "end": 20, "label": "Claim"},
        {"start": 50, "end": 70, "label": "Premise"},
    ],
    "options": [
        {"id": "Support", "text": "Support"},
        {"id": "Attack", "text": "Attack"}
    ]
}

To create that data, you could just iterate over your manually annotated examples and create one new example for all possible combinations of Claim and Premise. So, you check the "spans", copy the example and create a new on that has two spans with claim 1 and premise 1, the next one with claim 2 and premise 1, and so on.

This sounds simple, but is potentially very effective: Even if the majority of the connected claims/premises are not related, it'll usually be very obvious and rejecting the example in the UI takes you less than a second. If you get a match, you can focus on the label option.

Btw, once you've converted your data and added the options, you should be able to use the mark recipe to annotate the example with the choice interface:

prodigy mark your_dataset your_choice_data.jsonl --view-id choice

The data you get from this will have the same format as the example above, with an added "answer" and the selected option, e.g. "accept": ["Support"]. So if you filter out the rejected and ignored answers, you'll have a set of examples with two spans each and a label describing their relationship.

OK, thanks @ines! I’ll test this.

In order to verify the setup, I’m trying to run this for a single example.

First I used the same whole text containing the full raw text and every single span from the text correctly highlighted and classified in the tool. However, I would have to select a single option for this whole data, so this doesn’t make sense. So I edited the jsonl to contain only the two spans that would be proper answer, marked as ‘accept’, selecting the proper option, and saving the dataset.

After dumping the dataset, this gave me the same jsonl I used for input, but with a new "accept":["Support"],"answer":"accept" property added to it. So, I’m thinking that the various combinations of spans you told me before would have to contain the same tokens and text values replicated for each span combination, right?

Thanks!

Nice, glad it worked as expected!

Yes, exactly. You can even leave the tokens out completely, because you already have the character offsets in the text and don't want to do manual labeling. So you just copy the text however many times you need it, and add the options and the spans pairs. This should all be pretty easy to automate in Python :slightly_smiling_face:

ok, got it!

Hi @ines!

I have the dataset ready, as we discussed here. But now that I think of it, since you pointed earlier that the two spans with the options would be a better fit than the dep dataset, what prodigy recipe then do I use to train a classifier with this data? A classifier that takes two spans as inputs and labels the relation between them.

Thanks!