I haven't used Prodigy, but I am considering buying it to use it to annotate ner, textcat, coref for a variety wiki pages as well as other (non-scientific) online publications. I want to know how much do I get out of the box. For example, will I get all the corefs annotated based on wikidata or some such? Will tables be recognized as tables and treated as a structure, including captions, subheadings? I particular, how will table rows be considered when shuffling sentences for training? Are the positions of images/frames in the document preserved while shuffling? Can one annotate images in videos? Does one have to write adapters to extract data from different online publications, or does prodigy come with a range of adapters which work well with all publications.
Hi! You can find the detailed documentation of Prodigy here, which should give you an overview of the built-in workflows and interfaces, as well as the Python API: https://prodi.gy/docs
To answer the more specific questions:
This is definitely something you can do and how you set it up depends on the types of coreference relations you're looking to extract and the model you're looking to train. The built-in
coref.manual workflow includes pre-defined rules that take advantage of predicted part-of-speech tags to let you focus on proper nouns, nouns and pronouns: https://prodi.gy/docs/recipes#coref
You can also use the more general-purpose
rel.manual workflow to annotate relationships and define your own rules for spans to merge and tokens to exclude (which you know will never be part of a coref relationship): https://prodi.gy/docs/recipes#rel-manual
spaCy currently doesn't have a built-in component, so you'd have to select a plugin or model implementation you want to train – this may also impact how you choose to define the task and label scheme.
These questions really come down to how you frame the problem from a machine learning perspective. Ultimately, if you train an NLP model like a coreference resolution component or a named entity recognizer, what it will get to see at runtime and during training is plain text – so that's typically also what you want to be annotating. For model implementations for tasks that make predictions based on the structure of a sentence (NER, coref, tagging, parsing etc.) you also want the input to be real sentences – otherwise, there's not really anything to learn. (For some additional background, also see this comment on the problems with including markup in a regular named entity recognition task.)
You can include placeholders for any other elements like images or frames, and those will be preserved in the data – but it's unclear how useful this will be, because there's nothing your model can really learn from the information. It could potentially be helpful for a text classifier if you include the information that a paragraph has an image, or include the image alt text with it so the document label can take that into account.
If your tables contain longer text, one option could be to just convert them to plain text. There are some approaches for incorporating information like formatting as features of the model, but this will require some experimentation and a custom model implementation
Shuffling will only happen during training and only for the purpose of improving accuracy and preventing the model from memorising the training data. So if the example is self-contained, e.g. a paragraph, this should be no problem. One thing to keep in mind is the context windows of the models you're working with: for example, an entity recognizer typically has a fairly narrow context window of a few tokens on either side which it will take into account.
If your goal is to annotate videos for a task like object tracking, you'd often be working with the individual frames of the video (or a representative selection of frames). This is something you can do using a workflow like
image.manual, and you could even have a script that extracts the relevant frames programmatically. This thread discusses some ideas and approaches for this type of workflow:
Prodigy comes with different loaders for the most common file types, and also lets you implement your own using simple Python scripts: https://prodi.gy/docs/api-loaders
When it comes to scraping and pre-processing, this is typically something you want to implement specific to the data you're working with, because different sites and publications can differ a lot, so it's hard to have one general-purpose solution for it. You might end up needing a slightly different script for each publication, and keep tweaking it to handle special cases. You probably also want to do this as a separate preprocessing step, instead of including it in your annotation workflow: if you're scraping large volumes of text, it's more efficient to run this as a separate process on a more powerful machine, possibly using a cluster with multiple workers in parallel etc. The data you end up annotating later on should ideally be the final preprocessed result that you're happy with so you don't end up having to re-annotate and re-train whenever you make improvements to your preprocessing logic.