Pattern only custom recipe failed with 10K annotation records

I've created my textcat recipe which only use patterns (2K patterns). It worked with 1000 annotation records. However the server reports the follow error when I tried with 10K annotation records. The reason I am trying 10K records or more is because there are very few positive cases in my data.

Here is the error -

    File "/home/ec2-user/anaconda3/envs/prodigy_sense2vec/lib/python3.8/site-packages/prodigy/", line 370, in _shared_get_questions
    tasks = controller.get_questions(session_id=session_id, excludes=excludes)
  File "cython_src/prodigy/core.pyx", line 138, in prodigy.core.Controller.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 68, in prodigy.components.feeds.SharedFeed.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 73, in prodigy.components.feeds.SharedFeed.get_next_batch
  File "cython_src/prodigy/components/feeds.pyx", line 153, in prodigy.components.feeds.SessionFeed.get_session_stream
  File "cython_src/prodigy/components/feeds.pyx", line 135, in prodigy.components.feeds.SessionFeed.validate_stream
  File "/home/ec2-user/anaconda3/envs/prodigy_sense2vec/lib/python3.8/site-packages/toolz/", line 376, in first
    return next(iter(seq))
RuntimeError: cannot re-enter the tee iterator

Here is the teach code -
indent preformatted text by 4 spaces

def textcat_pattern_teach(
    dataset: str,
    spacy_model: str,
    source: Union[str, Iterable[dict]] = "-",
    label: Optional[List[str]] = None,
    api: Optional[str] = None,
    patterns: Optional[str] = None,
    init_tok2vec: Optional[Union[str, Path]] = None,
    loader: Optional[str] = None,
    long_text: bool = False,
    exclude: Optional[List[str]] = None

    components = teach(dataset=dataset, spacy_model=spacy_model,
                       source=source, patterns=patterns, label=label)
if spacy_model.startswith("blank:"):
        nlp = spacy.blank(spacy_model.replace("blank:", ""))
        nlp = spacy.load(spacy_model)
    #model = TextClassifier(nlp, label, long_text=long_text, init_tok2vec=init_tok2vec)
    stream = JSONL(source)
    if patterns is None:
        nlp = space.load(spacy_model) 
        #predict = model
        #update = model.update
        matcher = PatternMatcher(
        matcher = matcher.from_disk(patterns)
        # Combine the textcat model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        #predict, update = combine_models(model, matcher)
        predict, update = matcher, matcher.update 
    #stream = prefer_uncertain(predict(stream))
    stream = prefer_high_scores(predict(stream))
    return {
        "view_id": "classification",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "update": update

Hi! By annotation records you mean examples in the data you're loading in? 10k really isn't much data at all so I don't think it's directly related to that.

The error here seems to happen when Prodigy is validating the first batch of the stream and then putting the generator back together. Apparently that error in itertools.tee is related to multiprocessing and raised because the iterator it creates isn't thread-safe...? I'm not sure why this happens in some cases for some generators and not others, and if it's a Python 3.8 thing or not, but we'll investigate!

Yes, annotation records I meant examples in the data.
I'm using python 3.7 and Prodigy 1.9.6

Are you sure it's 3.7? The paths in the error you shared above show 3.8, so just checking to make sure.

That could be the problem. I've multiple copies of Prodigy installed. Checking...

After I moved my custom recipe to the prodigy in the running condo env, the error was gone. Thanks.

1 Like

Glad you solved it – and very interesting and good to know that this is how an issue like this can surface :face_with_monocle: