I'm embarking on a project focused on tagging relationships between entities to annotate the structure of rhetorical texts. The goal is to identify and analyze structures such as claims, rhetorical questions, facts, opinions, declines of hypothetical claims, regard for other speakers, and more. While I believe I have a schema that suits my needs, I have a few queries regarding the post-annotation phase:
Querying the Tagged Dataset: How can I query the tagged dataset once the annotation is complete? Are there built-in tools within Prodigy, or would I need to export the data and use external tools to analyze the annotations?
Examples: Are there any exemplary projects or datasets that demonstrate the usability of relationship annotations and querying processes, especially in the context of rhetorical text structures? Any references would be greatly appreciated.
Analysis of Structures and Sequences: I'm interested in analyzing the frequency of certain structures and sequences in the annotated data. Is there a recommended approach for this within Prodigy, or should I consider external tools or scripts?
Effort Estimation: Given my objective to identify specific rhetorical patterns based on the annotations, how much effort would be required to query and analyze the data? Does Prodigy have any built-in features to assist in this, or should I be prepared to implement a custom querying system?
Any guidance or insights would be invaluable. Thank you for your time and expertise.
Thanks for your question and welcome to the Prodigy community
Sounds like an interesting project!
Yes. There are database components that allow you to connect directly to the database. You can create your own Python scripts off of these:
from prodigy.components.db import connect
db = connect()
all_dataset_names = db.datasets
examples = db.get_dataset_examples("my_dataset")
You can also use db-out to export the data, but I'd recommend using the Python components first as it will give you more flexibility.
Alternatively if you plan to train a model with spaCy, I'd also consider data-to-spacy as it's a very helpful recipe to convert your annotations to spaCy binary files and creates a default config file that can be used for spacy train.
If you mean example projects on relations extraction, yes! Sofie has written a wonderful blog post with an accompanying tutorial video and GitHub template project. The use case is biomedical, not rhetoric text structure, but hopefully it should give you a good understanding. It also includes a Prodigy recipe and some ideas of how to annotate in Prodigy.
What's important to know is that this project was also developed to show how to create a custom trainable component in spaCy. I mention because it does start to get a bit deep, but this can also be powerful as it also shows how to extend the trainable component if your problem is slightly different. I'd also recommend searching spaCy's GitHub discussion forum if you have questions on the spaCy component.
Are you simply looking for basic stats about your annotations, e.g., how frequently certain relations or entity types exist? Prodigy doesn't offer a lot of out of the box stats functions so likely it'll be best to write your own scripts. We are releasing soon a built-in component for Inter-Annotator Agreement (IAA), but I suspect you're more interested in stats about the annotations rather than agreement metrics across different users.
Likely the closest may be spaCy's data debug, which if you save your annotations as spaCy binary files (i.e., use data-to-spacy), and then run data debug on your binary files. This is helpful for tasks like spancat where you can see a host of data characteristics. I suspect ner should work too. Unfortunately, I don't think there's support for relations as it's not a native component.
Thank you for your comprehensive response; it's greatly appreciated.
Regarding querying the dataset, I understand the flexibility that the Python components provide, and I'm intrigued by the capabilities of connecting directly to the database. However, a significant part of my team is not technically inclined. Is there a no-code platform or alternative approach you'd suggest that could facilitate easier access for my non-tech-savvy teammates to query the data?
Another crucial aspect for our project is language support. Can you confirm if Prodigy supports Hebrew and Aramaic for annotations?
Once again, thank you for the guidance. I'm excited to delve deeper into the resources and get started with the annotation process.
We offer the db-out recipe which exports out the .jsonl file. You're welcome to write your own script to simplify for non-programmer teammates. Maybe you could write a custom jupyter notebook that analyzes that output file?
Prodigy is designed to be a developer's annotation tool. The assumption is that you'd have at least one developer who then could try to write your own custom tools (e.g., a streamlit app or jupyter notebook) for non-tech savvy teammates.
So what's most important is that you have some tokenizer for those languages if you're annotating spans or relations. Technically, Prodigy can work for any language -- but you'd need to have a tokenizer. Out-of-the-box, Prodigy works with spaCy so you'd need to use one of the tokenizers. SpaCy does have a tokenizer for Hebrew, which I know Prodigy users have used. However, I'm not aware of one for Aramaic and I'm not sure if the Hebrew tokenizer would work.
One feature Prodigy does have is right-to-left support:
It sounds like you're planning to write your own custom component and maybe need help on tokenizers like for Aramaic. So I'd encourage you to look through and use spaCy's GitHub discussion forum. Since you mention your team isn't very tech savvy, I think you may find a lot of your questions are spaCy questions.