Segmenting examples with long spans as NERs

Continuing the discussion from Annotation for Argument Mining:

Hi @ines!

As you instructed me in this other post, I’m handling the Argument Mining task as two different problems: classifying long spans as named entities (Claim, Major Claim and Premise) and, subsequently, linking them as relations (supports / attacks relations).

Well, for the first part, I was able to produce a dataset with the annotated entities and their spans, marked as ‘accept’. Here’s a sample: essays_entities_sublist.jsonl (42.5 KB)

I ran ner.batch-train with this dataset and was able to get the following results:

I’m still seeing if it’s possible to increase the results, but considering how the entities are quite long spans, I don’t think the results are bad at all. Any suggestions on how I could do better with the ner model? Perhaps even handling this as a different problem, text classification perhaps.

But my actual problem is that if I don’t use the ‘unsegmented’ parameter, the recipe is losing the spans for each example after segmenting them. My dataset of 400 examples goes to a few thousand sentences, but only about 11 of these have the original spans, all the other thousands have empty spans. Do you know what could be the problem?

Thanks!

Hi @pvcastro,

As I mentioned in the last thread, I'm suspicious of using the entity recognised for these long spans. I think you should try applying sentence labels, and perhaps also marking words which are important for the category you're interested in. Then you can use the dependency parse to find the claim boundaries. You can find documentation about the dependency parser here: Linguistic Features · spaCy Usage Documentation

There should be as many spans, whether you set unsegmented or not --- unless the spans cross segmentation boundaries. This sounds like it might be a bug; we'll look into it. At first glance the segmentation function looks correct, and it's passing our tests. But I'll play around with your sample and see if I can find the problem.

Thanks for the tips @honnibal!

In case you need the full data to investigate the issue, here it is: essays_entities.jsonl (1.7 MB)

Do you think it makes sense to annotate the sentences using ner.manual, and then using the annotated dataset to apply a text classification recipe instead of ner.batch-train?

My concern is that the user would need access to the full text in order to annotate each sentence, since this would provide the context necessary for the annotations decisions.