Testing add_tokens

Hi there,

I was running some tests trying to fix a Mismatched tokenization error I’m getting, and I tried running the following code, that was recommended in another thread:

from prodigy.components.preprocess import add_tokens
import en_core_web_sm

nlp = en_core_web_sm.load()
text = " The upstart streaming service, which is primarily geared for sports fans, has an uphill climb against deep-pocketed competitors marketing cable alternatives to cord-cutters: YouTube TV, Hulu Live and Sony's PlayStation Vue."
stream = [{'text': text, 'spans': {'start': 175, 'end': 185}}]
new_stream = add_tokens(nlp, stream)
print(list(new_stream))

I’m getting the following exception when running this code:

TypeError                                 Traceback (most recent call last)
<ipython-input-115-958a6dcd96e1> in <module>()
  6 stream = [{'text': text, 'spans': {'start': 175, 'end': 185}}]
  7 new_stream = add_tokens(nlp, stream)
----> 8 print(list(new_stream))

cython_src/prodigy/components/preprocess.pyx in add_tokens()

TypeError: string indices must be integers

Thanks!

I think you might actually have a small typo in your stream: "spans" here is a dictionary, when it should be a list of dictionaries.

The new validation mechanism should catch errors like that within the recipes. If you want to implement something like this yourself (e.g. to make sure that your stream is formatted correctly for the annotation task), you can also call into the validator directly.

from prodigy.components.validate import Validator

validator = Validator('ner_manual')  # the view_id you want to use the stream with
for eg in stream:
    validator.check(eg)

Disclaimer: This is currently internals only, so the API may change in the future.

Thanks!

I copied the code exactly from here, but adding the brackets solved it!

Ah, sorry, that was also a typo in my semi-pseudocode then! Thanks, fixed :+1: