UI crashes on custom spans

robinsonkwame · May 21, 2021, 2:11pm

I have a NER use case where regexs are useful for pre-annotation, similar to how Prodigy uses pattern files. So I saved off regex spans afer converting them to token, char offsets in the spans key of my jsonl dataset.

However, when annotating the dataset some of the regex spans identify/cover the wrong content. When clicking and removing them the UI breaks. I've attached a small .jsonl reproducing the problem. A screen shot is provided below.

ui_bug.jsonl (1.6 KB)

ines · May 24, 2021, 12:24am

Hi! Could you share an example of the custom JSON you're streaming in? And can you double-check that the data matches the expected format? https://prodi.gy/docs/api-interfaces#ner_manual

The most relevant parts are:

the example should define a list of valid "tokens" with text, start, end and an ID
each span should define its start and end character offsets, as well as the start and end token index (inclusive) that it refers to

If you're using a custom recipe and spaCy for tokenization, the add_tokens preprocessor should take care of adding the tokens and aligning the spans for you.

robinsonkwame · May 25, 2021, 1:44pm

Sure, the example .jsonl was included in the initial post (ui_bug.jsonl) but I'll attach here again.

Example of custom JSON: ui_bug.jsonl (1.6 KB)
e..g,

{"meta":{"form_name":"12549608361_JustFund-Common-Proposal-Questions.txt","page_number":2,"n":200,"jsonl_version":"0.03"},"text":"2 12549608361_JustFund-Common-Proposal-Questions.txt\nHomelessness\n\n- Human Rights / Civil Rights & Liberties\n\n- Immigration\n\n- LGBTQ+\n\n- Racial Justice\n\n- Transportation / Utilities / Public Infrastructure\n\n- Other\n\nURGENT NEED\n\nSpecific urgent need categories may be active for limited times. If this\nproposal suits an urgent need, select the category it fits. Note, this\noption will have no choices available if there are no current urgent\nneed categories.\n\n- None\n\n- COVID-19\n\nDONATION INFORMATION\n\nPlease list the address where all contributions should be sent. If you\nhave a Fiscal Sponsor, please list your Fiscal Sponsor's address.\n\nDonation Website:\n\nDonation Instructions:\n\nCheck Donation Addressed To:\n\nCheck Donation Memo Line:\n\nCheck Street Address:\n\nCheck City:\n\nCheck State:\n\nCheck ZIP:\n\n*Required fields\n","spans":[{"token_start":14,"token_end":59,"label":"ANSWER","start":2,"end":12},{"token_start":59,"token_end":76,"label":"ANSWER","start":12,"end":16},{"token_start":76,"token_end":88,"label":"ANSWER","start":16,"end":20},{"token_start":88,"token_end":108,"label":"ANSWER","start":20,"end":25},{"token_start":108,"token_end":164,"label":"ANSWER","start":25,"end":34},{"token_start":164,"token_end":175,"label":"ANSWER","start":34,"end":38},{"token_start":419,"token_end":429,"label":"ANSWER","start":88,"end":92},{"token_start":429,"token_end":443,"label":"ANSWER","start":92,"end":96}]}

It looks like the tokens field is also required. But the documentation says that ner_manual will add that, "... [t]he ner_manual interface allows highlighting spans of text based on token boundaries, and will use a model to tokenize the text and add an additional "tokens" property to the annotation task."

Despite the documentation above am I still required to add token if I pre-populate spans?

ines · May 26, 2021, 1:38am

Ah, I had missed the uploaded file in your previous post! The missing "tokens" property is most likely the problem here, yes.

Sorry if the docs here were misleading: I think what this sentence was trying to say was that the interface typically takes advantage of a model to produce the "tokens" programmatically. So recipes using the ner_manual interface (e.g. ner.manual) typically use the add_tokens helper to add tokens and map spans to them:

github.com

explosion/prodigy-recipes/blob/0037b32d954e0b1672f9dae1e8aa53ac0c9136e3/ner/ner_manual.py#L38-L41

    
      
          # Tokenize the incoming examples and add a "tokens" property to each
          # example. Also handles pre-defined selected spans. Tokenization allows
          # faster highlighting, because the selection can "snap" to token boundaries.
          stream = add_tokens(nlp, stream)

(I'll update the docs to make this more clear, thanks for pointing this out!)

robinsonkwame · May 26, 2021, 1:38pm

Thanks for the clarification! The snippet you posted says "Also handles pre-defined selected spans," which suggests that I do not need to add a token field when using predefined "spans"?

For example, the .jsonl I posted does not include "token" and it looks like ner.manual will add "tokens" to pre-defined spans. So why specifically is my missing "tokens" property the issue here?

ines · May 27, 2021, 9:23am

Ah, so what this means is: the add_tokens helper will automatically update "spans" that are present in the data with the respective token information (start and end token index). It will also check that the pre-defined spans are aligned with the tokenization. So if your source data consists of texts and spans with character offsets only, add_tokens will add a "tokens" property and include the token references in the "spans".

So what the recipe sends out at the end should be the exact format that the ner_manual interface expects – it's just that you don't necessarily have to think about the tokenization and span alignment yourself and can let the add_tokens helper handle it for you.

robinsonkwame · May 27, 2021, 3:41pm

Thank you, that was helpful, I have enough here to make this work.

Topic		Replies	Views
preannotated spans in input json not showing up usage , spancat	6	927	August 24, 2021
Random crash of NER UI while annotating ner , done , front-end	8	906	April 13, 2021
Tokenization causes glitched text usage , ner , solved	1	376	November 2, 2021
Providing NER token spans only (no character offsets) usage , spacy , best-practices	2	1873	August 12, 2019
Pre-annotation does not work usage , ner , custom , solved	5	297	November 17, 2021

UI crashes on custom spans

Related topics