No Task Available

Prodigy error "No Task Available" specific error "ERROR: Can't fetch tasks. Make sure the server is running correctly.

Can someone please help me in figuring this out this is a big data set but I was only able to get to 284, before I got this error message.

This is the code being run
"# Prodigy using jsonl data that was converted from json in convert_data.ipynb
!python -m prodigy ner.manual test_data_5K blank:en ./test_data_5K.jsonl --label PER,ORG,MISC,LOC"

This is the output error messages

"Using 4 label(s): PER, ORG, MISC, LOC

:sparkles: Starting the web server at http://0.0.0.0:8088 ...
Open the app in your browser and start annotating!

Task exception was never retrieved
future: <Task finished name='Task-11' coro=<RequestResponseCycle.run_asgi() done, defined at /usr/local/anaconda3/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py:388> exception=ValueError("Mismatched tokenization. Can't resolve span to token index 1. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.\n\n{'start': 1, 'end': 4, 'label': 'ORG'}")>
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 393, in run_asgi
self.logger.error(msg, exc_info=exc)
File "/usr/local/anaconda3/lib/python3.8/logging/init.py", line 1463, in error
self._log(ERROR, msg, args, **kwargs)
File "/usr/local/anaconda3/lib/python3.8/logging/init.py", line 1577, in _log
self.handle(record)
File "/usr/local/anaconda3/lib/python3.8/logging/init.py", line 1586, in handle
if (not self.disabled) and self.filter(record):
File "/usr/local/anaconda3/lib/python3.8/logging/init.py", line 807, in filter
result = f.filter(record)
File "cython_src/prodigy/util.pyx", line 121, in prodigy.util.ServerErrorFilter.filter
File "/usr/local/anaconda3/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 390, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in call
return await self.app(scope, receive, send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/fastapi/applications.py", line 140, in call
await super().call(scope, receive, send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/applications.py", line 134, in call
await self.error_middleware(scope, receive, send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/middleware/errors.py", line 178, in call
raise exc from None
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/middleware/errors.py", line 156, in call
await self.app(scope, receive, _send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/middleware/cors.py", line 84, in call
await self.simple_response(scope, receive, send, request_headers=headers)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/middleware/cors.py", line 140, in simple_response
await self.app(scope, receive, send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/middleware/base.py", line 25, in call
response = await self.dispatch_func(request, self.call_next)
File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/app.py", line 198, in reset_db_middleware
response = await call_next(request)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/middleware/base.py", line 45, in call_next
task.result()
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/middleware/base.py", line 38, in coro
await self.app(scope, receive, send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/exceptions.py", line 73, in call
raise exc from None
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/exceptions.py", line 62, in call
await self.app(scope, receive, sender)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/routing.py", line 590, in call
await route(scope, receive, send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/routing.py", line 208, in call
await self.app(scope, receive, send)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/routing.py", line 41, in app
response = await func(request)
File "/usr/local/anaconda3/lib/python3.8/site-packages/fastapi/routing.py", line 129, in app
raw_response = await run_in_threadpool(dependant.call, **values)
File "/usr/local/anaconda3/lib/python3.8/site-packages/starlette/concurrency.py", line 25, in run_in_threadpool
return await loop.run_in_executor(None, func, *args)
File "/usr/local/anaconda3/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/app.py", line 420, in get_session_questions
return _shared_get_questions(req.session_id, excludes=req.excludes)
File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/app.py", line 391, in _shared_get_questions
tasks = controller.get_questions(session_id=session_id, excludes=excludes)
File "cython_src/prodigy/core.pyx", line 223, in prodigy.core.Controller.get_questions
File "cython_src/prodigy/core.pyx", line 227, in prodigy.core.Controller.get_questions
File "cython_src/prodigy/components/feeds.pyx", line 99, in prodigy.components.feeds.SharedFeed.get_questions
File "cython_src/prodigy/components/feeds.pyx", line 106, in prodigy.components.feeds.SharedFeed.get_next_batch
File "cython_src/prodigy/components/feeds.pyx", line 245, in prodigy.components.feeds.RepeatingFeed.get_session_stream
File "/usr/local/anaconda3/lib/python3.8/site-packages/toolz/itertoolz.py", line 376, in first
return next(iter(seq))
File "cython_src/prodigy/components/preprocess.pyx", line 130, in add_tokens
File "cython_src/prodigy/components/preprocess.pyx", line 222, in prodigy.components.preprocess._add_tokens
File "cython_src/prodigy/components/preprocess.pyx", line 199, in prodigy.components.preprocess.sync_spans_to_tokens
ValueError: Mismatched tokenization. Can't resolve span to token index 1. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

{'start': 1, 'end': 4, 'label': 'ORG'}"

Can someone assist me with this issue?

Hi! It's usually not so helpful to bump a thread, especially not after such a short period of time – it can often make it harder for us to keep track of new threads and make sure we can answer everyone.

If you see an error message in the UI about the server not running correctly, it typically means that the Python recipe raised an error. In this case, the cause of the error was this:

Which recipe are you running and how are you loading in the data? Are you labelling examples with pre-defined spans?

Basically, what the error means is that Prodigy came across an annotated span referring to a slice from character 1 to 4 – but this slice doesn't map to valid tokens produced by the tokenizer. If you've generated pre-annotated spans yourself, this can sometimes happen if you have leading or trailing whitespace, or an off-by-one error in the offsets. In some cases, it can also mean that your data assumes that a string is split when it isn't – for instance, A-B["A", "-", "B"] vs. ["A-B"]. So you can either adjust the tokenization rules, provide your custom "tokens" in the data (if you'll be using the same tokenization during training) or adjust your annotation scheme to ensure that the model will be able to learn from the data. Also see this thread for tips on how to find and resolve mismatched tokenization:

HI, Thank you for replying back.

this is the data that I have been using. after line 284 that is when I got this error message. Some sentience's have gotten labeled already and I want to check if it was done correctly.
(it might be easier to talk over the phone to fix this issue, is there a phone number to call?)

As you can see the format of the data is all the same, hence my confusion why it stoppered after line 284. and saying error "ValueError: Mismatched tokenization. Can't resolve span to token index 1. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

{'start': 1, 'end': 4, 'label': 'ORG'}" Clearly you can see that the tokens are the same throughout and using the same format.

This is my run command for Prodigy
!python -m prodigy ner.manual test_data_5K blank:en ./test_data_5K.jsonl --label PER,ORG,MISC,LOC

I looked at your message, and took a look at the text for the first line not sure why but i think """" might be the issue. when I deleted them from the first line of data (see below) I get another error message (see below)

Using 4 label(s): PER, ORG, MISC, LOC
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/main.py", line 53, in
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 331, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "cython_src/prodigy/core.pyx", line 353, in prodigy.core._components_to_ctrl
File "cython_src/prodigy/core.pyx", line 142, in prodigy.core.Controller.init
File "cython_src/prodigy/components/feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.init
File "cython_src/prodigy/components/feeds.pyx", line 155, in prodigy.components.feeds.SharedFeed.validate_stream
File "/usr/local/anaconda3/lib/python3.8/site-packages/toolz/itertoolz.py", line 376, in first
return next(iter(seq))
File "cython_src/prodigy/components/preprocess.pyx", line 130, in add_tokens
File "cython_src/prodigy/components/preprocess.pyx", line 222, in prodigy.components.preprocess._add_tokens
File "cython_src/prodigy/components/preprocess.pyx", line 199, in prodigy.components.preprocess.sync_spans_to_tokens
ValueError: Mismatched tokenization. Can't resolve span to token index 101. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

{'start': 101, 'end': 112, 'label': 'PER'}

Hi!

As Ines explained, some of your input data has an issue with tokenization not matching the tokenizer used during annotation of the model. For instance, if your tokenizer says that "Tim's house" are two words, "Tim's" and "house", then you won't be able to annotate just "Tim" as a Span because that wouldn't align to token boundaries. So for that example, you'd either have to change your span annotation, or your tokenizer, or provide the token annotations in your input file.

I can't grep through your screenshot of the data, but you must have a line somewhere with a span {'start': 1, 'end': 4, 'label': 'ORG'}" - this particular line will likely be the culprit.

Are you saying that a sentence with the name Bill Gates can not be grouped together?
I am assuming and using Prodigy by highlighting the entire name and saying this is a Person. What you seem to suggest I have to say Bill is a person and Gates is a person and I can not say Bill Gates is a person.

No, that's not what I was trying to say.

The first step in any NLP pipeline is called "tokenization" - it means breaking up a sentence into tokens or words. Typically, a sentence "Bill Gates is rich" will be broken up into 4 tokens: ["Bill", "Gates", "is", "rich"].

Tokenization can be slightly different depending on the exact rules that are being used though. For instance, the sentence "I won't go". Can be parsed as 3 tokens, keeping "won't" together as one token, or as 4 tokens, splitting up "won't". It depends on the tokenizer.

When you're annotatings spans with Prodigy, a Span always needs to follow those token boundaries. That means that a span needs to include full tokens, so it can be "Bill Gates" - spanning two tokens. However, you typically can't annotate the substring "Gate" in "Gates" if "Gates" is one token.

That is what your error message is about: you have a span annotation that does not align with the token boundaries defined by the tokenizer you're using in your recipe. The error message helps you to identify the "offending" input. If you can cite it here, we can likely help you fix that particular example.

Okay I understand what you are saying.

original error message is this, below, I am assuming that line one in the jsonl file is the issue as that is what the token index is saying.

ValueError: Mismatched tokenization. Can't resolve span to token index 1. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

{'start': 1, 'end': 4, 'label': 'ORG'}

This is the first line of the file

No, the error is not saying that there is a problem with the first line.

Can't resolve span to token index 1.
{'start': 1, 'end': 4, 'label': 'ORG'}

There is a particular line with an ORG entity annotated from char 1 to 4, which doesn't align to token boundaries. The span can't start at index 1, because that's not a token start.

Can you find the data line with {'start': 1, 'end': 4, 'label': 'ORG'} in it?

Hi,

I looked at the data and there is no {'start': 1, 'end': 4, 'label': 'ORG'} present.
Not sure why this is the error message as there is no span that uses start 1 and end 4 for ORG label.

please assist. I can attach the data file if you would like but i am not sure how to attach it.

When you hit reply, there should be an upload icon that let's you attach files to your post. If you attach your dataset (not a screenshot), I can have a look :wink:

here is the JSONL file, Thank you.
test_data_5K.jsonl (1.0 MB)

Here's the example that defines the span with {'start': 1, 'end': 4, 'label': 'ORG'}:

{"text":"/LIC traditionally has to push through layers of bureaucracy to get anything done.","spans":[{"start":1,"end":4,"label":"ORG"}]}

As @SofieVL explained earlier, the problem here is that /LIC is treated as one token, whereas the span expects to describe a token starting at character 1 – but that will never be true using the tokenization provided. This means that the example annotates token-based tags that will never match, since the tokens they describe aren't actually produced. Prodigy alerts you to that because it means that otherwise, you'd be creating annotations that your model can't learn from.

There are a few similar instances in the data that you can find by searching for "start": 1,, but I think they're mostly texts starting with quotes, which the tokenizer would split by default. But it might still be worth double-checking them. (The thread I linked above includes code you can use to quickly check your annotated spans against a given tokenization and find mismatches programmatically.)

If the mismatches are caused by the data not being "clean", e.g. leftover markup etc., adding a preprocessing step can help – just make sure the same preprocessing is also applied at runtime later so the model doesn't see anything unexpected. Alternatively, if arbitrary trailing characters are common in your data, another solution would be to add some stricter tokenization rules that split off characters like / at the beginning of a word, so you end up with ["/", "LIC"] instead of ["/LIC"].

Thank you for your help in fining the issue the issue was for "/" character in between two words without a space in between. I corrected this and added a space. I found another error after this that pointed to a "'" next to a word.

I was converting a JSON to a JSONL file, because Prodigy needs a JSONl file type. I tried to fix the spacing issues from having stand alone tokens to a sentence format. I suppose I need to make a better rules in deciding how to concatenate the tokens into a nice string with correct spacings. It seems my spacing rules that I made was the cause of this.

That said in an earlier reply from one of your staff you indicated that we can customize how it looks and parses for tokens. can you explain to me how to customize this further. That maybe a route to take to have clean data that Prodigy can do NER on.

Have you looked at this thread that Ines linked earlier for dealing with non-matching tokens? Matching tokenisation on pre-existing annotated data

The other option is indeed to define the tokens in your input data - then the tokenizer won't run and will just take the tokens as you've defined them. You can see the expected format here: https://prodi.gy/docs/api-interfaces#ner_manual

{
  "text": "First look at the new MacBook Pro",
  "spans": [
    {"start": 22, "end": 33, "label": "PRODUCT", "token_start": 5, "token_end": 6}
  ],
  "tokens": [
    {"text": "First", "start": 0, "end": 5, "id": 0},
    {"text": "look", "start": 6, "end": 10, "id": 1},
    {"text": "at", "start": 11, "end": 13, "id": 2},
    {"text": "the", "start": 14, "end": 17, "id": 3},
    {"text": "new", "start": 18, "end": 21, "id": 4},
    {"text": "MacBook", "start": 22, "end": 29, "id": 5},
    {"text": "Pro", "start": 30, "end": 33, "id": 6}
  ]
}

(you don't need to have the spans predefined. If not, you have to annotate them all yourself)