Wrap breaks for long documents

Hi everyone,

I have a specific need to annotate quite long documents with relations and entities, also showing some metadata, and we have begun work on a custom recipe to do that. We want to have the full document in our browser (scrolling would not be an issue) but the wrap for line breaks in the relation view seems to break past 7'000 characters (I don't have the token count yet, sorry), giving us :

For reference, here is our page :

Here is the recipe definition, nothing spectacular :

{
        "dataset": dataset,
        "view_id": "blocks",
        "stream": stream,
        "config": {
            "lang": lang,
            "relations_span_labels": [
                "Avocat",
                "Cabinet",
                "Partie Personne Morale",
                "Forme Juridique",
            ],
            "labels": ["Substitue", "Représente", "A pour status", "membre de"],
            "blocks": [
                {
                    "view_id": "html",
                    "html_template": "<div> Annoter la decision suivante: {{stream['meta']['short_title']}} </div>",
                },
                {"view_id": "relations"},
            ],
        },
    }

I would love at least a work around to have line breaks in our interface, as we have line breaks in our documents, and they are significant for our purposes. I have seen newlines in relations annotation and it doesn't specifically help given that wrap breaks on us ! Is there any way to break lines on newline tokens ?

As an aside, I noticed that on wider screens, prodigy uses very little of the available space, which is a bummer for us given the quantity of text needed to grasp our documents (we work on legal decisions in France).

Edit: For readability, here is the trace given by the error :

[Exception... "Failure"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: http://localhost:8080/bundle.js :: setHeight :: line 170"  data: no]
StageWrap@http://localhost:8080/bundle.js:220:11863
FiberProvider@http://localhost:8080/bundle.js:220:10007
div
div
_Content@http://localhost:8080/bundle.js:164:7859
bundle.js/createWithStyles/</WithStyles<@http://localhost:8080/bundle.js:88:15449
_Relations@http://localhost:8080/bundle.js:220:58610
bundle.js/createWithStyles/</WithStyles<@http://localhost:8080/bundle.js:88:15449
_Blocks@http://localhost:8080/bundle.js:164:20801
bundle.js/createWithStyles/</WithStyles<@http://localhost:8080/bundle.js:88:15449
div
_CardWithoutTheme@http://localhost:8080/bundle.js:220:89391
bundle.js/createWithStyles/</WithStyles<@http://localhost:8080/bundle.js:88:15449
Card@http://localhost:8080/bundle.js:220:92095
ErrorBoundary@http://localhost:8080/bundle.js:220:95619
div
div
_Annotator@http://localhost:8080/bundle.js:220:100285
bundle.js/createWithStyles/</WithStyles<@http://localhost:8080/bundle.js:88:15449
ConnectFunction@http://localhost:8080/bundle.js:71:12533
main
div
_class2@http://localhost:8080/bundle.js:93:42083
_Main@http://localhost:8080/bundle.js:220:118586
bundle.js/createWithStyles/</WithStyles<@http://localhost:8080/bundle.js:88:15449
ConnectFunction@http://localhost:8080/bundle.js:71:12533
ThemeProvider3@http://localhost:8080/bundle.js:75:24970
App@http://localhost:8080/bundle.js:222:5900
Provider@http://localhost:8080/bundle.js:75:1139

hi @Martin,

Thanks for your post.

Unfortunately, it seems like you're getting this error due to the long size of the document, especially for relations:

One option could be to disable tokens (use --disable-patterns) you're not interested in, for example see the docs.

However, I suspect your document is still way too long.

I would love at least a work around to have line breaks in our interface, as we have line breaks in our documents, and they are significant for our purposes. I have seen newlines in relations annotation and it doesn't specifically help given that wrap breaks on us ! Is there any way to break lines on newline tokens ?

Why don't you write a pre-processing script for this?

For example, this Stack Overflow shows how to break up by new line characters this before using spaCy. You could combine this with this example:

So now your source (input) file would be text broken up by the new line character, so you can still use the rel.manual recipe.

You can modify the card size by modifying the CSS in your configuration like:

"global_css": ".prodigy-container { max-width: 950px; }"

There are also other options (e.g., change font size) in the docs to modify the UI space.

Hope this helps!

1 Like

So, after a bit of experimentations, we settled (for now) on a sliding window of around 10 lines of text, to annotate our documents. Line breaks in our documents are significant so we cannot just throw them away, and there is an issue on relations. Some relations are between entities that are really far apart (sometimes up to 5 lines) which means we cannot annotate all relations in one sliding window as some might be outside the example while others are present and must be annotated...

I don't know whether to mark examples where we only have some values and relations but not all of them as rejected examples, or if we should annotate them anyway, even if they are missing relations.

I foresee some issues trying to train a relation model with all these separate annotation of 10 lines windows, given that on some windows we will have relations with out of the window entities... I am not sure about how to make this workflow better. Do you have any idea ? I'm thinking about maybe trying to add a pseudo-sentencizer that separates blocks of significant value (where the relations are sure to be) but it's a lot of extra work...

Hi @Martin,

My colleague @ryanwesslen is OOO for a few days so I thought I would jump in to provide some advice.
I do share your concern that "incomplete" examples will affect negatively the performance of the classifier. It is also true, though, that loading full documents is not a good annotator experience either even if it was technically possible (and bad UI usually translates into inaccurate labels).

Just to clarify: by sliding window you mean that you have window of size 10 (lines) and you move it by increments of <=5, which effectively means that you look at each line multiple times in different contexts, right?
In that case, given that most relations are within 5 lines you should be able to see them all at some point.
Assuming that this is the setup (as opposed to splitting the document into chunks size 10 and iterating through that), it's probably best to annotate only complete examples and have a postprocessing step to merge all the relations. This way you should be able to catch them all or at least catch high enough number and variation so that the model will be able to generalize.
In that case, for the efficiency of the annotation process we would recommend performing NER and REL separately. NER without the the sliding window and REL with the sliding window over the NER dataset as discussed above.

1 Like