Loading a text file

dsr2021 · July 3, 2021, 12:06pm

I'm new to using prodigy. I'm trying to annotate a simple text file using the en_core_web_lg model. However, in the web UI, I see only the title of the article. The format of the file is, title, blank line, paragraphs separated by blank lines. (I believe spaCy 3, the underlying framework for prodigy does recognize paragraphs. Am I right?) How do I get prodigy to load the full file?

The next thing I'd like to do is ask prodigy to load a webpage. How do I do that?

ines · July 3, 2021, 12:48pm

Hi! If you're loading a plain text file into Prodigy, it will be read in line by line – otherwise, there wouldn't really be a way for the loader to know how the text should be segmented. That's also why we usually recommend working with a more structured format like JSON or JSONL, which lets you control how the text should be segmented, and include different amounts of newlines wherever you need them. (Some recipes will segment sentences by default, but you can disable this by setting the --unsegmented flag.)

In general, there's no limitation built into spaCy or Prodigy in terms of what you can read in (sentences, paragraphs, longer documents). Paragraphs are usually a good limit to work with, because they're quick to read and you can easily process them in batches.

This kinda depends on what your end goal is. If your goal is to scape text from websites, you usually want to do the scraping as a pre-process and make sure you can extract the text cleanly and reliably, before you start the annotation process. Otherwise, you'll have to re-annotate whenever you tweak your scraping logic, which is pretty inconvenient.

dsr2021 · July 3, 2021, 1:55pm

Doesn't spaCy use the fact that articles are structured in paragraphs? For example, if paragraph 23 said "The structure described in paragraph 4 is...", then it would be difficult to coref without preserving the structure. I was under the impression spaCy broke up text documents into lists of paragraphs containing lists of sentences containing lists of tokens? Maybe I'm wrong.

dsr2021 · July 3, 2021, 6:45pm

Ok. it was my mistake. I notice the UI is feeding me one non-blank line at a time.

ines · July 5, 2021, 12:51am

Cool, glad you got it working!

Under the hood in spaCy, sentences (doc.sents) are just different views of the doc, just like named entity spans etc. This information is also accessible on the individual tokens, i.e. a token provides is_sent_start or ent_type_. There's no definition of paragraphs and how you structure your Doc objects is up to you – we typically recommend using a reasonable unit of text, which can be paragraphs or sections (there's not really an advantage in making your Doc the entire document, and smaller chunks are often easier to work with and process).

Topic		Replies	Views
Prodigy NER Long Text? usage , ner , textcat	3	621	August 6, 2021
Script: Load data in spaCy v3's .spacy format Getting Started spacy , project , streams , nightly	4	2391	January 21, 2023
Best way to prepare a long text for annotations usage , spacy , solved	4	2139	August 29, 2018
Book usage	1	394	March 4, 2022
Loading non-Prodigy pre-annotated text relations	1	87	May 28, 2024

Loading a text file

Related topics