relation is responding very slowly

Hi,
I have a major question on relation annotation being responded very slowly on my web; and a minor question on visulizing chinese tokenization (on relation annotation).

Q1
The relation annotation front end is responding very slow on my operation, like moving, clicking, thus my annotation can not proceed. This happens when I have a rather long text (each document has thousands words). If I force split the document, the problem is greatly eased. However, this is not acceptable, because I have to annotate relations that across different sentences, and there is no valid position to split the text. (annother workaround is to split the document with overlapped sentences, but this will cause extra efforts, because I have to annotate the overlapped sentences twice).

Some info that might be helpful:

python3 -m prodigy rel.manual re1 blank:zh dataset:demo1 --label=Relation.label --wrap

Q2
I'm annotating on chinese text, so the text is not correctly tokenized and visulized as such. I tried load a spacy pipeline model (which is produced using spacy pipeline to_disk, and can correctly tokenize my text), But this changes nothing, the visulization I see in the web is still the same, as shown above (only pre-labeled entity are tokenized correctly, others remain plain text).
This is how I create my tokenize model:

from spacy.lang.zh import Chinese
cfg = {"segmenter": "jieba"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.to_disk("jieba_tokenizer")

This is how I import it in my rel annotation service:

python3 -m prodigy rel.manual re1 jieba_tokenizer dataset:demo1 --label=Relation.label --wrap

Is this the problem of rel.mannual, or am I doing wrong? Can you tip me the related document or example?

Thanks!

Benfeng

Hi and sorry about this – it's currently expected that the interface becomes less performant for very long documents with many tokens, and we're working on a rewrite that doesn't have this problem. I think what makes it additionally tricky in your case is that you just end up with more tokens overall, due to the way the characters are segmented.

As a workaround, one thing you can do is use patterns to disable all tokens that you know won't ever be part of a relation – of course, only if that's possible. Obvious candidates are punctuation, but you might be able to write other disable patterns based on part-of-speech tags etc.

Can you double-check that when you load your custom pipeline with the tokenizer in Python and process a text, the tokens are segmented correctly? If a Doc produced by the model shows the correct tokens, Prodigy should refelct this accordingly in all recipes that use the model for tokenization.

Hi @ines, what is the timeline like for the rewrite that you mentioned?

I can't give you a specific ETA at this point yet, sorry! But it's definitely something we have on our list of enhancements.

1 Like

I have the same issue with very slow (unusable) interface with long documents and rel.manual.
Any updates?

The issue occurs with at least hover event on tokens. @ines is it maybe possible to edit some JS and disable some features that causing this or this is a core code?
I am working on long range dependency problems. I have tested LabelStudio and its relation annotation is working fast, but the software has other drawbacks.

It's unfortunately not as simple as that because it all ties into how the interface is implemented,so need to refactor the interface using a different technology, which is currently in progress. The easiest workaround at the moment if you want to work with long dependencies is to try and limit the possible connections as much as you can by disabling tokens, e.g. those that you know won't be relevant. So if you're annotating relations between named entities, you can disable all tokens that are not part of an entity.

I disabled all the tokens that are not part of an entity and the time to create one relation is about 8 seconds (in document with 6k tokens and 500 spans) :frowning:

Great to hear that refactor is in progress!