Creating Corpus from multiple large text files

Grantulla · April 5, 2022, 1:44pm

Hello!

Apologies if this is covered in the documentation but I am new to this and have read through multiple times and can't figure out the best approach to this problem. So I have 1-2k multiple paragraph txt files, non annotated. Is it possible to merge them all into one data source and tokenize by sentence so that using prodigy I can label each sentence as useful or not (each instance is a single sentence from the data source)?

Thank you,
-Grant

ines · April 9, 2022, 10:34am

Hi! If your goal is to annotate at the sentence or paragraph level, one option could be to preprocess your texts and split them up before feeding them to Prodigy. So for each unit you want to annotate, you'd create one record in your JSON(L) file with "text": "...". The nice thing about the preprocessing step is that you could do this as a single job and process your files in parallel, so it'll be a lot faster.

If you want to split by paragraphs and your text is nicely segmented with two newlines between paragraphs, you could use a simple regex and/or just Python for it. If you want to split by sentences, you could use an existing spaCy model and the doc.sents to extract the individual sentences.

Topic		Replies	Views
Best way to prepare a long text for annotations usage , spacy , solved	4	2143	August 29, 2018
Sentence Segmentation and Annotations usage , spacy , legal	2	1545	January 23, 2020
combining two annotated datasets usage , ner , spacy , solved	5	1526	July 28, 2020
Sentencize already annotated data usage , spacy , solved , training	2	506	January 4, 2022
Best Practices for Segmenting Text into Passages and Applying Multi-label Classification	1	802	September 13, 2023

Creating Corpus from multiple large text files

Related topics