what is best way to to extract paragraph or long sentences in a text document?

Hello, I am doing Information extraction task to extract 5 different entities. Out of 5, 4 are real entities and 5th one is long text identification. What is the best way to do using Prodigy and spaCy?. I am trying usual prodigy and spaCy ner way for the first 4 entities where i am progressing slowly. Now the 5th one is not actually an entity. Its a para or long sentences extraction. I can give a simple example. articles info come from different sites so the format is not consistent to use rule-based extraction.

The word abstract before abstract starts is not always present. Otherwise i would have taken every sentence after the word abstract. Also, sometimes journal informaiton is at bottom of the text and conclusion paragraph after abstarct information. What is the best way to identify abstract here?. Can i continue as a NER task?

Hi! I think highlighting very long spans by hand is definitely inefficient and unnecessarily complicated. And extracting those long spans is also not something you can solve as an NER task.

Maybe you could try framing this as a text classification task and annotate at the sentence or paragraph level? This lets you click through each section and all you have to do is hit accept or reject, depending on whether the text you see is an abstract.

Thanks i will try that way.

I'd say that I've had previous success framing this as a text classification task, so can only further recommend Ines' advice

Ines Montani I am ready to start this classification task after successfully completed NER task of Title, Journal, dates extraction using spacy. Now I have 2500 articles needs to identify abstracts. As i showed in the examples, each articles contains title, authors, journal information dates, abstract, objectives etc.. At the moment, All are in the new line separated 2500 text file. As i mentioned, i couldn't apply rules like length of the paragraph or headings etc. Because sometimes abstract length is less than 200 chars too so that might be the para of combination of journal, dates, authors... SO i would like to try textcat.manual.

what is the best way to import all 2500 articles to the prodigy?. Do i need to combine all of them into one large jsonl file and create one json per one line? If so, how to identify each article in prodigy?. can i put same meta name for each line in one article?

That's probably the easiest option, yes. You can split them up into logical chunks (paragraphs etc.) and create one record per chunk. If your file gets too big, you could also create multiple files and then annotate them in order – start the server with file 1, then with file 2 etc.

If you add custom properties to your JSON, Prodigy will simply pass them through and save them with the annotations. So you can include custom meta like the ID etc. For example:

{"text": "...", "internal_id": 1234}

Anything you put in the "meta" dict will be displayed in the bottom right corner of the annotation card – so you could use that to store meta information you want to see during annotation. For example:

{"text": "...", {"meta": "internal_id": 1234}}

Thanks. Do I need to create 2 labels: ABSTRACT and OTHER or just one label called Abstract.

You don't need to create 2 labels – you can just have ABSTRACT and then treat everything with a low score as OTHER. Even if you decide to train with two labels later on, you can convert the data automatically (everything that wasn't accepted for ABSTRACT gets the label OTHER). There's no need to worry about this during annotation and make things more complicated. Annotating it as a binary yes/no decision will be much faster.

Thank you so much. I will do and let you know how it goes

1 Like

Hello @ines, I manually classified 5300 sentences and trained.
18:54:41: INIT: Setting all logging levels to 20
18:54:43: RECIPE: Calling recipe 'train'
18:54:43: RECIPE: Starting recipe train
18:54:43: DB: Initializing database SQLite
18:54:43: DB: Connecting to database SQLite
:heavy_check_mark: Loaded model 'en_vectors_web_lg'
18:54:59: DB: Loading dataset 'abstract_16_02_2020' (5310 examples)
Created and merged data for 5310 total examples
Using 4248 train / 1062 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
:information_source: Baseline accuracy: 0.347
=========================== :sparkles: Training the model

Loss F-Score

1 134.02 0.920
2 0.87 0.933
3 0.14 0.934
4 0.05 0.933
5 0.04 0.932
6 0.03 0.934
7 0.03 0.933
8 0.03 0.933
9 0.03 0.931
10 0.02 0.928
============================= :sparkles: Results summary

abstract 0.934

Best ROC AUC 0.934
Baseline 0.347.

Below is the output of some samples from validation set:
{'abstract': 0.9658493399620056}
{'abstract': 0.8887423276901245}
{'abstract': 0.06501883268356323}
{'abstract': 0.764168918132782}
{'abstract': 0.017291178926825523}
{'abstract': 0.038891710340976715}
{'abstract': 0.982439398765564}
{'abstract': 0.05525602400302887}
{'abstract': 0.03555752709507942}

How to set the threshold?. Do I need to manually see each one from the validation set and confirm?

Also, to further create more annotation I used textcat.teach with existing model.
prodigy textcat.teach abstract_18_02_2020 abstract_Model_vectors abstract_new_dataset.jsonl --label abstract

How do I train the model again after textcat.teach?. The existing train command takes dataset rather existing model. Sorry for these questions if they do not make any sense.

What exactly do you mean by threshold? You can read more about the evaluation metric here, btw: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

The train command takes the name of the component you want to train, the dataset(s) to train from, and a loadable spaCy model – that can either be a blank base model, or the path to the pretrained model you want to update.

Thanks. I noticed that i am getting 1/4th of suggestions from my dataset with scores at the bottom in the random order while doing textcat.teach. is it common? or something wrong?.

My dataset contains 6000 sentences but textcat.teach only showed 1790 examples in that session.

Also overall F1score down to 92% from 93% after i re-trained using existing model and new datset.

Created and merged data for 1790 total examples
Using 1432 train / 358 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
:information_source: Baseline accuracy: 0.921

Loss F-Score

1 84.86 0.921
2 2.78 0.929
3 0.50 0.929
4 0.14 0.928
5 0.07 0.928
6 0.02 0.925
7 0.01 0.925
8 0.01 0.926
9 0.00 0.927
10 0.00 0.929


abstract 0.929

Best ROC AUC 0.929
Baseline 0.921

Sorry for asking again. is there anything i am doing wrong? or textcat.teach only show the sentences with scores where textcat.teach is unsure about?

Ah, sorry, I think I missed your previous comment. And yes, the purpose of active learning-powered recipes like textcat.teach is to select the most relevant examples – this will always include skipping the less relevant examples. If you want to annotate each example in your dataset, you can use the textcat.manual workflow.

prodigy textcat.teach abstract_18_02_2020 abstract_Model_vectors abstract_new_dataset.jsonl --label abstract

python -m prodigy train textcat abstract_18_02_2020 abstract_Model_vectors --output abstract_final_Model --n-iter 10 --eval-split 0.2 --dropout 0.2

Does abstract_final_Model will be the combination of both "abstract_18_02_2020" and "abstract_Model_vectors" right?. In that case, i increased training dataset 1700 examples more so the accuracy should be either same or more not less.
PS: In the combined datset, 2889 are accept and 4218 are reject examples. So reject are more dominant than accept.

Hi friend. It's very nice to see you have such a good resolution for your question. May I ask you a question? That is how do I import my 'file.csv' (Contains 90,000 abstracts extracted from scientific papers)to Prodigy for entity annotations? Hope your reply, friend. Thanks in advance.

I have XML data for each article. I obtained relevant information from XML then created a JSON object for each scientific paper and loaded into Prodigy.

Hi friend, thanks for your response. I have a couple of questions, they are:
Q1: If you create a JSON file for one scientific paper, doesn't it mean that you can annotate just one paper one time? After finish that, you will load another paper into Prodigy and continue annotation, right?
Q2: For me, all the 90,000 abstracts are in a file.csv and each abstract in one line. So I just have no idea how to load my file.csv into Prodigy. Could you offer me some recommendations or show me some examples?
Q3: You know, each abstract contains almost 300 words, can this long text be loaded into Prodigy?
I am new in using Prodigy, so I am so sorry to bug you again. Many thanks and look forward to hearing from you soon.

Q1. No. We have create jsonl file with all the papers.
Q2. You can load CSV file into prodigy but i created by own jsonl file. Here is the simple template of the code. Please adjust however you want to
import csv
import jsonlines
def create_prodigy_file():
with open('abstracts.csv') as csv_file, jsonlines.open('abstracts.jsonl', 'a') as jsonl_writer:
csv_reader = csv.reader(csv_file, delimiter=',')
i = 0
for row in csv_reader:
jsonl_writer.write({'text': row[0], 'meta': {'source': 'abstract'+i}})
i += 1
Q3. NO need to worry about the 300 long text.

I have successfully annotate that task into classification problem and got 95% accuracy by using spacy train.