github annotations / textcat example

Hi all,

I started the text classification example, but ran into this error

  File "msgpack/_unpacker.pyx", line 187, in msgpack._cmsgpack.unpackb
ValueError: 1792000 exceeds max_bin_len(1048576)

Apparently I’m loading too much data at once (or at least, more than my 16GB). I obviously could buy another 16GB RAM, but it’s just waiting until I have more than 32GB data. Workarounds, things I do wrong…

thanks,

Andreas

The full dump below

python3 -m prodigy textcat.teach gh_issues en_core_web_sm "docs" --api github --label DOCUMENTATION
Using 1 labels: DOCUMENTATION
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ahe/.local/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/usr/.local/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/usr/.local/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/ahe/.local/lib/python3.6/site-packages/prodigy/recipes/textcat.py", line 45, in teach
    nlp = spacy.load(spacy_model, disable=['ner', 'parser'])
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/util.py", line 112, in load_model
    return load_model_from_link(name, **overrides)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/util.py", line 129, in load_model_from_link
    return cls.load(**overrides)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/data/en_core_web_sm/__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/util.py", line 173, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/util.py", line 156, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/language.py", line 647, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/language.py", line 643, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 643, in spacy.pipeline.Tagger.from_disk
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 626, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 627, in spacy.pipeline.Tagger.from_disk.load_model
  File "/home/ahe/.local/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 335, in from_bytes
    data = msgpack.loads(bytes_data, encoding='utf8')
  File "/home/ahe/.local/lib/python3.6/site-packages/msgpack_numpy.py", line 184, in unpackb
    return _unpackb(packed, **kwargs)
  File "msgpack/_unpacker.pyx", line 187, in msgpack._cmsgpack.unpackb
ValueError: 1792000 exceeds max_bin_len(1048576)
...

This sounds like a known issue with a recent version of the msgpack dependency, that introduced this limit. Could you try downgrading to msgpack-python==0.5.6? (Alternatively, upgrading to the latest spaCy/Thinc should also resolve this, because it’ll pull in the correct msgpack.)

Hi Ines,

thanks for response. I did a fresh install of spacy/thinc, yesterday; that did not help. downgrading the msgpack-python==0.5.6 did the trick

The github example has a (minor) issue. Every 10 annotations, I have (in the frontend) the message “No tasks available”. However, in some cases it reloads and gives me 10 more statements to annotate. The workaround is to save the 10 annotations, and re-start your server for the next 10 annotations. I checked on two different networks, but that does not solve the problem

raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://api.github.com/search/issues?q=docs&order=desc&sort=created&per_page=100&page=10
(full traceback below)

I guess it’s the last “page=10” statement?

Any suggestions on how to solve the issue?

best,

Andreas

full traceback:

11:21:24 - Task queue depth is 1
11:21:26 - Task queue depth is 2
11:21:40 - Exception when serving /get_questions
Traceback (most recent call last):
  File "/home/ahe/.local/lib/python3.6/site-packages/waitress/channel.py", line 336, in service
    task.service()
  File "/home/ahe/.local/lib/python3.6/site-packages/waitress/task.py", line 175, in service
    self.execute()
  File "/home/ahe/.local/lib/python3.6/site-packages/waitress/task.py", line 452, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "/home/ahe/.local/lib/python3.6/site-packages/hug/api.py", line 423, in api_auto_instantiate
    return module.__hug_wsgi__(*args, **kwargs)
  File "/home/ahe/.local/lib/python3.6/site-packages/falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "/home/ahe/.local/lib/python3.6/site-packages/hug/interface.py", line 793, in __call__
    raise exception
  File "/home/ahe/.local/lib/python3.6/site-packages/hug/interface.py", line 766, in __call__
    self.render_content(self.call_function(input_parameters), context, request, response, **kwargs)
  File "/home/ahe/.local/lib/python3.6/site-packages/hug/interface.py", line 703, in call_function
    return self.interface(**parameters)
  File "/home/ahe/.local/lib/python3.6/site-packages/hug/interface.py", line 100, in __call__
    return __hug_internal_self._function(*args, **kwargs)
  File "/home/ahe/.local/lib/python3.6/site-packages/prodigy/app.py", line 105, in get_questions
    tasks = controller.get_questions()
  File "cython_src/prodigy/core.pyx", line 109, in prodigy.core.Controller.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 66, in prodigy.components.feeds.SharedFeed.get_next_batch
  File "cython_src/prodigy/components/sorters.pyx", line 147, in __iter__
  File "cython_src/prodigy/components/sorters.pyx", line 51, in genexpr
  File "cython_src/prodigy/models/textcat.pyx", line 122, in __call__
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/language.py", line 548, in pipe
    for doc, context in izip(docs, contexts):
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/language.py", line 572, in pipe
    for doc in docs:
  File "pipeline.pyx", line 858, in pipe
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/language.py", line 746, in _pipe
    for doc in docs:
  File "pipeline.pyx", line 431, in pipe
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/language.py", line 551, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "/home/ahe/.local/lib/python3.6/site-packages/spacy/language.py", line 544, in <genexpr>
    texts = (tc[0] for tc in text_context1)
  File "cython_src/prodigy/models/textcat.pyx", line 120, in genexpr
  File "cython_src/prodigy/components/filters.pyx", line 35, in filter_duplicates
  File "cython_src/prodigy/components/filters.pyx", line 16, in filter_empty
  File "cython_src/prodigy/components/loaders.pyx", line 22, in _rehash_stream
  File "cython_src/prodigy/components/loaders.pyx", line 601, in __iter__
  File "cython_src/prodigy/util.pyx", line 406, in prodigy.util.make_api_request
  File "/home/ahe/.local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://api.github.com/search/issues?q=docs&order=desc&sort=created&per_page=100&page=10

Since it’s GitHub’s free API, it’s possible that it doesn’t always respond fast enough to fill up the queue by the time you’re done annotating. I think 10 per page is the built-in limit for that endpoint, so you might find it easier and more reliable to write a script that pulls the data from the API and saves it to a JSONL file, and then annotate from a file. The live APIs are nice to try things out, but once you’re getting more serious about annotating, saving the data upfront is definitely more reliable.

You can probably use Prodigy’s built-in GitHub loader to do this – see the PRODIGY_README.html for details. Something like this should work:

from prodigy.components.loaders import GitHub
from prodigy.util import write_jsonl

stream = GitHub(query='python', order='desc')  # or whatever
# The above is a generator, so converting it to a list should
# evaluate it until the API runs out of results / rate limits you
data = list(stream)

# Save it all to a file
write_jsonl('/path/to/data.jsonl', data)