what is best way to to extract paragraph or long sentences in a text document?

Hello, I am doing Information extraction task to extract 5 different entities. Out of 5, 4 are real entities and 5th one is long text identification. What is the best way to do using Prodigy and spaCy?. I am trying usual prodigy and spaCy ner way for the first 4 entities where i am progressing slowly. Now the 5th one is not actually an entity. Its a para or long sentences extraction. I can give a simple example. articles info come from different sites so the format is not consistent to use rule-based extraction.

The word abstract before abstract starts is not always present. Otherwise i would have taken every sentence after the word abstract. Also, sometimes journal informaiton is at bottom of the text and conclusion paragraph after abstarct information. What is the best way to identify abstract here?. Can i continue as a NER task?

Hi! I think highlighting very long spans by hand is definitely inefficient and unnecessarily complicated. And extracting those long spans is also not something you can solve as an NER task.

Maybe you could try framing this as a text classification task and annotate at the sentence or paragraph level? This lets you click through each section and all you have to do is hit accept or reject, depending on whether the text you see is an abstract.

Thanks i will try that way.

I'd say that I've had previous success framing this as a text classification task, so can only further recommend Ines' advice

Ines Montani I am ready to start this classification task after successfully completed NER task of Title, Journal, dates extraction using spacy. Now I have 2500 articles needs to identify abstracts. As i showed in the examples, each articles contains title, authors, journal information dates, abstract, objectives etc.. At the moment, All are in the new line separated 2500 text file. As i mentioned, i couldn't apply rules like length of the paragraph or headings etc. Because sometimes abstract length is less than 200 chars too so that might be the para of combination of journal, dates, authors... SO i would like to try textcat.manual.

what is the best way to import all 2500 articles to the prodigy?. Do i need to combine all of them into one large jsonl file and create one json per one line? If so, how to identify each article in prodigy?. can i put same meta name for each line in one article?

That's probably the easiest option, yes. You can split them up into logical chunks (paragraphs etc.) and create one record per chunk. If your file gets too big, you could also create multiple files and then annotate them in order – start the server with file 1, then with file 2 etc.

If you add custom properties to your JSON, Prodigy will simply pass them through and save them with the annotations. So you can include custom meta like the ID etc. For example:

{"text": "...", "internal_id": 1234}

Anything you put in the "meta" dict will be displayed in the bottom right corner of the annotation card – so you could use that to store meta information you want to see during annotation. For example:

{"text": "...", {"meta": "internal_id": 1234}}

Thanks. Do I need to create 2 labels: ABSTRACT and OTHER or just one label called Abstract.

You don't need to create 2 labels – you can just have ABSTRACT and then treat everything with a low score as OTHER. Even if you decide to train with two labels later on, you can convert the data automatically (everything that wasn't accepted for ABSTRACT gets the label OTHER). There's no need to worry about this during annotation and make things more complicated. Annotating it as a binary yes/no decision will be much faster.

Thank you so much. I will do and let you know how it goes

1 Like