Hi! If you're loading a plain text file into Prodigy, it will be read in line by line – otherwise, there wouldn't really be a way for the loader to know how the text should be segmented. That's also why we usually recommend working with a more structured format like JSON or JSONL, which lets you control how the text should be segmented, and include different amounts of newlines wherever you need them. (Some recipes will segment sentences by default, but you can disable this by setting the
In general, there's no limitation built into spaCy or Prodigy in terms of what you can read in (sentences, paragraphs, longer documents). Paragraphs are usually a good limit to work with, because they're quick to read and you can easily process them in batches.
This kinda depends on what your end goal is. If your goal is to scape text from websites, you usually want to do the scraping as a pre-process and make sure you can extract the text cleanly and reliably, before you start the annotation process. Otherwise, you'll have to re-annotate whenever you tweak your scraping logic, which is pretty inconvenient.