UI crashes on custom spans

I have a NER use case where regexs are useful for pre-annotation, similar to how Prodigy uses pattern files. So I saved off regex spans afer converting them to token, char offsets in the spans key of my jsonl dataset.

However, when annotating the dataset some of the regex spans identify/cover the wrong content. When clicking and removing them the UI breaks. I've attached a small .jsonl reproducing the problem. A screen shot is provided below.

ui_bug.jsonl (1.6 KB)

Hi! Could you share an example of the custom JSON you're streaming in? And can you double-check that the data matches the expected format? https://prodi.gy/docs/api-interfaces#ner_manual

The most relevant parts are:

  • the example should define a list of valid "tokens" with text, start, end and an ID
  • each span should define its start and end character offsets, as well as the start and end token index (inclusive) that it refers to

If you're using a custom recipe and spaCy for tokenization, the add_tokens preprocessor should take care of adding the tokens and aligning the spans for you.

Sure, the example .jsonl was included in the initial post (ui_bug.jsonl) but I'll attach here again.

Example of custom JSON: ui_bug.jsonl (1.6 KB)

{"meta":{"form_name":"12549608361_JustFund-Common-Proposal-Questions.txt","page_number":2,"n":200,"jsonl_version":"0.03"},"text":"2 12549608361_JustFund-Common-Proposal-Questions.txt\nHomelessness\n\n- Human Rights / Civil Rights & Liberties\n\n- Immigration\n\n- LGBTQ+\n\n- Racial Justice\n\n- Transportation / Utilities / Public Infrastructure\n\n- Other\n\nURGENT NEED\n\nSpecific urgent need categories may be active for limited times. If this\nproposal suits an urgent need, select the category it fits. Note, this\noption will have no choices available if there are no current urgent\nneed categories.\n\n- None\n\n- COVID-19\n\nDONATION INFORMATION\n\nPlease list the address where all contributions should be sent. If you\nhave a Fiscal Sponsor, please list your Fiscal Sponsor's address.\n\nDonation Website:\n\nDonation Instructions:\n\nCheck Donation Addressed To:\n\nCheck Donation Memo Line:\n\nCheck Street Address:\n\nCheck City:\n\nCheck State:\n\nCheck ZIP:\n\n*Required fields\n","spans":[{"token_start":14,"token_end":59,"label":"ANSWER","start":2,"end":12},{"token_start":59,"token_end":76,"label":"ANSWER","start":12,"end":16},{"token_start":76,"token_end":88,"label":"ANSWER","start":16,"end":20},{"token_start":88,"token_end":108,"label":"ANSWER","start":20,"end":25},{"token_start":108,"token_end":164,"label":"ANSWER","start":25,"end":34},{"token_start":164,"token_end":175,"label":"ANSWER","start":34,"end":38},{"token_start":419,"token_end":429,"label":"ANSWER","start":88,"end":92},{"token_start":429,"token_end":443,"label":"ANSWER","start":92,"end":96}]}

It looks like the tokens field is also required. But the documentation says that ner_manual will add that, "... [t]he ner_manual interface allows highlighting spans of text based on token boundaries, and will use a model to tokenize the text and add an additional "tokens" property to the annotation task."

Despite the documentation above am I still required to add token if I pre-populate spans?

Ah, I had missed the uploaded file in your previous post! The missing "tokens" property is most likely the problem here, yes.

Sorry if the docs here were misleading: I think what this sentence was trying to say was that the interface typically takes advantage of a model to produce the "tokens" programmatically. So recipes using the ner_manual interface (e.g. ner.manual) typically use the add_tokens helper to add tokens and map spans to them:

(I'll update the docs to make this more clear, thanks for pointing this out!)

1 Like

Thanks for the clarification! The snippet you posted says "Also handles pre-defined selected spans," which suggests that I do not need to add a token field when using predefined "spans"?

For example, the .jsonl I posted does not include "token" and it looks like ner.manual will add "tokens" to pre-defined spans. So why specifically is my missing "tokens" property the issue here?

Ah, so what this means is: the add_tokens helper will automatically update "spans" that are present in the data with the respective token information (start and end token index). It will also check that the pre-defined spans are aligned with the tokenization. So if your source data consists of texts and spans with character offsets only, add_tokens will add a "tokens" property and include the token references in the "spans".

So what the recipe sends out at the end should be the exact format that the ner_manual interface expects – it's just that you don't necessarily have to think about the tokenization and span alignment yourself and can let the add_tokens helper handle it for you.

1 Like

Thank you, that was helpful, I have enough here to make this work.

1 Like