I'm new to using prodigy. I'm trying to annotate a simple text file using the en_core_web_lg model. However, in the web UI, I see only the title of the article. The format of the file is, title, blank line, paragraphs separated by blank lines. (I believe spaCy 3, the underlying framework for prodigy does recognize paragraphs. Am I right?) How do I get prodigy to load the full file?
The next thing I'd like to do is ask prodigy to load a webpage. How do I do that?
Hi! If you're loading a plain text file into Prodigy, it will be read in line by line – otherwise, there wouldn't really be a way for the loader to know how the text should be segmented. That's also why we usually recommend working with a more structured format like JSON or JSONL, which lets you control how the text should be segmented, and include different amounts of newlines wherever you need them. (Some recipes will segment sentences by default, but you can disable this by setting the --unsegmented flag.)
In general, there's no limitation built into spaCy or Prodigy in terms of what you can read in (sentences, paragraphs, longer documents). Paragraphs are usually a good limit to work with, because they're quick to read and you can easily process them in batches.
This kinda depends on what your end goal is. If your goal is to scape text from websites, you usually want to do the scraping as a pre-process and make sure you can extract the text cleanly and reliably, before you start the annotation process. Otherwise, you'll have to re-annotate whenever you tweak your scraping logic, which is pretty inconvenient.
Doesn't spaCy use the fact that articles are structured in paragraphs? For example, if paragraph 23 said "The structure described in paragraph 4 is...", then it would be difficult to coref without preserving the structure. I was under the impression spaCy broke up text documents into lists of paragraphs containing lists of sentences containing lists of tokens? Maybe I'm wrong.
Under the hood in spaCy, sentences (doc.sents) are just different views of the doc, just like named entity spans etc. This information is also accessible on the individual tokens, i.e. a token provides is_sent_start or ent_type_. There's no definition of paragraphs and how you structure your Doc objects is up to you – we typically recommend using a reasonable unit of text, which can be paragraphs or sections (there's not really an advantage in making your Doc the entire document, and smaller chunks are often easier to work with and process).