Annotating dependecies for very long sentences

We have not yet subcribed to Prodigy and wonder whether Prodigy will be useful for our annotation project of dependency relations. Our corpus has very long sentences and no web tool or interface has yet proved helpful for that. We have tried Arborator and several others, but dragging arcs is impossible when the sentence is very long. Working on the CoNNLU file is the only way out so far, but it sometimes makes you annotate the wrong numbers. Any insight as to whether Prodigy might be the solution? Thanks for feedback on this.

Hi! It's definitely true that once you have long texts with lots of very long dependencies, the visual gain you get from drawing arcs on top of the text can be low and allowing every token to be connected to every other token can make things more complex instead of solving complexity with a visual UI. This is more of a conceptual problem and the best solution ultimately depends on the type of annotation and what the dependencies represent.

When you say long sentences, how long are they on average? And what type of dependencies are you annotating? Are you working with syntax where you essentially need to connect every token to a root, or are you labelling different and more sparse annotations?

If you're not annotating syntax, there are various things you can do to reduce the complexity – for example, disabling tokens you don't need automatically (e.g. for coref) or annotating abstract representations (e.g. for sentence alignment).

If you are annotating syntactic dependencies (and especially if your goal is to create a proper treebank), there's obviously no way around labelling every token. I still think Prodigy can be useful here: you'll be able to toggle between line wrapping and inline view, hide/show the arcs to get a better overview and you can assign dependencies by clicking (instead of dragging). You can see a minimal example of a short sentence here, but the experience will be the same for long sentences. (Where it could become a bit trickier to ensure high performance is if your sentences are longer than ~300 tokens on average – but that would be very long sentences, so I assume yours are a bit shorter than that?)

Btw, I saw you're also emailed, and we're happy to set you up with an adademic license so you can try it out :slightly_smiling_face:

Many thanks indeed for your reply! I am annotating syntax dependencies in clinical text. Long sentences have on average 150 tokens, with plenty of list and conj relations, which demands drawing long arcs.
How about the output file once we annotate syntax dependencies? Can we export data as txt (CoNLL)? json? csv?
I have already applied for an academic license. I hope I am able to install it at all (I am a linguist, not too computer savvy).
Anyway, I look forward to trying Prodigy. Regards.

Prodigy's default output format is JSON and you can find an example of the dependencies/relations output here: https://prodi.gy/docs/api-interfaces#relations As you can see from the data, it should have everything you need: each dependency with its head, child and label, as well as the reference to the head and child tokens. If you need a different format, you should be able to convert it.

Sure, I totally understand! There are a few things about setting up a Python development environment that can be a bit tricky or unintuitive if you haven't done this before. But in general, we try to work with standard technologies and concepts wherever possible, so if you have a colleague who has experience with Python etc., they should be able to help without having to learn anything super specific to Prodigy :slightly_smiling_face:

After hours of trial and error, I managed to install Prodigy on my macbook. But now I am lost. I got "No module named prodigy"

Adrianas-MacBook-Pro:~ ariadna$ python3 -m prodigy
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 188, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 147, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 111, in _get_module_details
import(pkg_name)
File "/Users/ariadna/Library/Python/3.9/lib/python/site-packages/prodigy/init.py", line 1, in
from .util import init_package
ModuleNotFoundError: No module named 'prodigy.util'

Besides that, how do I get to an interface? Tks for any help.

Hi! Sorry you were having trouble! The latest stable Prodigy installer currently supports Python 3.6, 3.7 and 3.8 but it looks like you're running 3.9. So this is likely the problem here. So you can either install Python 3.8, or use a tool like pyenv that lets you run multiple versions of Python.

We'll be adding wheels for 3.9 to the upcoming new version – we first had to wait for all our dependencies to support it before we could build a version of Prodigy for 3.9.

Once you're set up, you can start an annotation workflow, also called "recipe" from the command line. Different recipes have different commands and options.

The getting started guide might be a good place to start and it explains the most important concepts: https://prodi.gy/docs#first-steps

Thanks again! I tried one of the recipes but got
Can't read file: disable_patterns.jsonl
Is there a step by step document/tutorial?

Sorry about this.

The --disable-patterns argument lets you define a path to a patterns file to disable tokens you know are unselectable. It's an optional setting, so you don't have to use it if it's not relevant for your use case.

The following command will start the rel.manual workflow, save the annotations to the dataset your_datset, pre-tokenize the text with English tokenization rules, load in a .txt file with raw text (replace this with your texts) and let you assign the labels A, B and C:

prodigy rel.manual your_dataset blank:en ./path/to/data.txt --label A,B,C

You can also load in your text as JSON(L) or CSV instead. You can also load in pre-tokenized data by providing a list of "tokens" (see here for the format).