Segmenting examples with long spans as NERs

pvcastro · June 27, 2018, 1:28pm

Continuing the discussion from Annotation for Argument Mining:

As you instructed me in this other post, I’m handling the Argument Mining task as two different problems: classifying long spans as named entities (Claim, Major Claim and Premise) and, subsequently, linking them as relations (supports / attacks relations).

Well, for the first part, I was able to produce a dataset with the annotated entities and their spans, marked as ‘accept’. Here’s a sample: essays_entities_sublist.jsonl (42.5 KB)

I ran ner.batch-train with this dataset and was able to get the following results:

I’m still seeing if it’s possible to increase the results, but considering how the entities are quite long spans, I don’t think the results are bad at all. Any suggestions on how I could do better with the ner model? Perhaps even handling this as a different problem, text classification perhaps.

But my actual problem is that if I don’t use the ‘unsegmented’ parameter, the recipe is losing the spans for each example after segmenting them. My dataset of 400 examples goes to a few thousand sentences, but only about 11 of these have the original spans, all the other thousands have empty spans. Do you know what could be the problem?

Thanks!

honnibal · June 27, 2018, 11:06pm

Hi @pvcastro,

As I mentioned in the last thread, I'm suspicious of using the entity recognised for these long spans. I think you should try applying sentence labels, and perhaps also marking words which are important for the category you're interested in. Then you can use the dependency parse to find the claim boundaries. You can find documentation about the dependency parser here: Linguistic Features · spaCy Usage Documentation

There should be as many spans, whether you set unsegmented or not --- unless the spans cross segmentation boundaries. This sounds like it might be a bug; we'll look into it. At first glance the segmentation function looks correct, and it's passing our tests. But I'll play around with your sample and see if I can find the problem.

pvcastro · June 28, 2018, 1:01pm

Thanks for the tips @honnibal!

In case you need the full data to investigate the issue, here it is: essays_entities.jsonl (1.7 MB)

pvcastro · June 28, 2018, 1:12pm

Do you think it makes sense to annotate the sentences using ner.manual, and then using the annotated dataset to apply a text classification recipe instead of ner.batch-train?

My concern is that the user would need access to the full text in order to annotate each sentence, since this would provide the context necessary for the annotations decisions.

Topic		Replies	Views
Sentence / long spans classification tasks with context	2	285	March 15, 2024
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
consolidating unsegmented and segmented annotations usage , ner	2	664	February 14, 2022
80 Entities ner.manual usage , ner , solved	7	804	August 15, 2021
Questions about ner.teach and ner.correct usage , ner	10	379	January 11, 2024

Segmenting examples with long spans as NERs

Related topics