Creating Corpus from multiple large text files


Apologies if this is covered in the documentation but I am new to this and have read through multiple times and can't figure out the best approach to this problem. So I have 1-2k multiple paragraph txt files, non annotated. Is it possible to merge them all into one data source and tokenize by sentence so that using prodigy I can label each sentence as useful or not (each instance is a single sentence from the data source)?

Thank you,

Hi! If your goal is to annotate at the sentence or paragraph level, one option could be to preprocess your texts and split them up before feeding them to Prodigy. So for each unit you want to annotate, you'd create one record in your JSON(L) file with "text": "...". The nice thing about the preprocessing step is that you could do this as a single job and process your files in parallel, so it'll be a lot faster.

If you want to split by paragraphs and your text is nicely segmented with two newlines between paragraphs, you could use a simple regex and/or just Python for it. If you want to split by sentences, you could use an existing spaCy model and the doc.sents to extract the individual sentences.