Text length for spancat model

Hi there, I'm working on a spancat project which implies the classification of sentences / paragraphs within project document. I want to build a categorizer that is able to recognize where in the document the text is referring to challenges, goals, and results. What do you think would be the optimal length for the text that enters in the annotation system? I think that working on a sentence based level can be a bit poor, sometimes it is important to consider the context at a paragraph level. Is there any specific limit / criteria or threshold that would be wise to respect?
Thanks!

The main thing that jumps to mind is that longer limits typically come at a compute cost.

Could you elaborate a bit more on what you mean with "challenges", "goals" and "results" though? If I understand your use-case better I might be able to give better advice. Do you have an example?

Hello, thanks a lot for the reply and sorry for the delay in my response. Briefly, these three categories are related to different purposes within the project management documents:

  • Challenges: the description of needs / issues in a certain context, over which the project / program should work
  • Goals: the desired change that the project / program is trying to achieve
  • Results: the effect of the project / program over its implementation

In my project I am running a classification over these three categories over a big bunch of documents. Some of them are very long documents (eg. report by instutions, sometimes hundreds of pages and therefore they need to be stripped down a bit to be processed), sometimes they are the abstract of the project from the EU portal. I am giving two examples, where I also put my classifications at the end of the text:

abstract: The proposed act aims to solve the housing problems of beneficiaries arising from the COVID-19 pandemic. This action enables all students of Greek higher education institutions to have equal and continuous access to the educational institutions they attend.
in this case:
Challenge: solve the housing problems of beneficiaries arising from the COVID-19 pandemic
Goal: This action enables all students of Greek higher education institutions to have equal and continuous access to the educational institutions they attend
Result: There is no result described since the text is still related to a proposal

Excerpt from a long text: Although inflation has flattened, high living costs are impacting households' ability to save and invest, and a high minimum wage is Children supporting the Aceh Youth Radio Program which affecting employment generation through new discussed youth issues related to the peace process. Despite a concerted effort by provincial and local governments and numerous donor supported programs aimed at creating a conducive business climate, investment in Aceh remains minimal. The draft Aceh Green policy framework employs a progressive approach to investment and development.
Politics, Security and Social Cohesion
Great progress has been achieved in reintegrating former combatants, political prisoners and returnees into social and political life in Aceh. However, the persistence of conflict-era identities and structures continues to thwart the full assimilation of some of these individuals into society. Levels of violent conflict have dropped ramatically since the signing of the Helsinki accords. Moreover, in the first half of 2009, the number of incidents of violence fell from previous years. Crime rates are well below those of neighbouring North Sumatra province and a perceived rise in criminality in 2008 that was undermining public trust in the peace process may have lessened with the successful elections and reduction in violent incidents. While this is encouraging, tensions over ongoing aid, mistrust between groups and dwindling reconstruction funds means that recent positive trends are not assured in the long term and ongoing attention is required.
Challenges: high living costs are impacting households' ability to save and invest, investment in Aceh remains minimal. persistence of conflict-era identities and structures continues to thwart the full assimilation of some of these individuals into society. tensions over ongoing aid, mistrust between groups and dwindling reconstruction funds means that recent positive trends are not assured in the long term and ongoing attention is required.
Goals:
Results: reintegrating former combatants, political prisoners and returnees, the number of incidents of violence fell from previous years.

These are just two examples.
Thanks a lot

From a first glance, it seems like you're all-most detecting full sentences. Would it perhaps be simpler to rephrase the problem into a sentence classification problem? Something like "does this sentence describe a challenge"? And "does this sentence describe goal"?

It might be that this is a simpler problem to work with for now, but it might be that I'm oversimplifying things.

Note that spaCy can make it easy to turn the original text into sentences for you. This might be a helpful preprocessing snippet:

nlp = spacy.load("en_core_web_md")

text = "Although inflation has flattened, high living costs are impacting households' ability to save and invest, and a high minimum wage is Children supporting the Aceh Youth Radio Program which affecting employment generation through new discussed youth issues related to the peace process. Despite a concerted effort by provincial and local governments and numerous donor supported programs aimed at creating a conducive business climate, investment in Aceh remains minimal. "

doc = nlp(text)
for sent in doc.sents:
    print(sent.text)
1 Like

First thanks a lot for the support. I think this approach can turn out very useful, mostly because it's very simple and it would increase a lot on the creation of the training set (and also i guess it would be much more efficient in terms of memory usage). My only perplexity would be over how much the sentence categorization can be effective for recognizing categories where the context of a whole paragraph is significant for categorizing a sentence.... but I guess I'll do some experiments and see how it performs. Thanks!

Happy to hear it.

I'm actually taking this approach for a personal project myself now and I'm pretty happy with the results. One final comment on this: the ML might get a bunch simpler ... but the annotation too! You can suddenly leverage the textcat interfaces and check one sentence at a time. From a "cognitive burden" perspective it might also be a lot faster to label.

Yep, that's also a big plus. I'm trying the approach right now together with the live suggestions from OpenAI recipe and -so far- the results are very promising.

1 Like

Happy to hear it.

Any feedback on the OpenAI recipes?

I'd say it's working quite well and it's very promising as a way to speed up hugely on the annotation process. My only suggestion would be on the definition of the labels: in my case, I'm using some labels that require some explanation (as I put in my previous post, where I extended the definition of the criteria underneath the label). I think it would be very useful to be able frame the query already specifying the criteria underneath each label. Maybe it's already a functionality but I couldn't do it on my own.