How is the `stream` being processed by the frontend?


I am in the process of setting up a custom interface and recipe for NER/document classification that:
i) displays metadata on the current example (fetched from a remote datastore), necessary for some classification tasks;
ii) adds pre-existing tags (e.g. on the basis of PatternMatcher)

My code looks roughly as follows:

@prodigy.recipe("custom-recipe", ...)
def custom_recipe(*args):
    nlp = # get spacy model based on 
    stream = # iterator over rows in a parquet file, each row as dictionary with a "text" key
    stream = add_metadata_from_database(stream)  # iterator that queries remote store for metadata
    matcher = PatternMatcher(nlp)
    stream = (example for example, score in matcher(stream))
    stream = add_tokens(nlp, stream)

    return {
        "dataset": dataset,
        "stream": stream,
        "config": {"blocks": [....]}

Fetching metadata is not particularly heavy, however, given rate limitations, the queries against the remote store should ideally be done in a lazy manner (i.e. whenever an example is being displayed on the frontend for labeling purposes).

Prodigy appears to preload 10 examples by default. However, whenever I add the call to the matcher, the matcher tries to match and score all examples in the stream. Is there a way to also apply PatternMatcher in a lazy manner (one at a time or in reasonable batches)?

PS: When using PatternMatcher in prodigy, do I also have to be careful about how to add large numbers of phrase patterns to the matcher (similar to how it is done in [spacy]( Rule-based matching · spaCy Usage Documentation))?

Ah, it looks like we're not currently exposing a batch size setting for the built-in pattern matcher, so it uses the default of spaCy's nlp.pipe, which is quite large. I'll fix this for the next release!

In the meantime, the easiest solution that also gives you the most flexibility would be to just call spaCy's PhraseMatcher (or Matcher) directly in your recipe. You can then control how the patterns are created, how many examples are processed at a time before matching and which components to disable when you create the Doc objects (e.g. if you're not using any attributes predicted by the model for matching).

Going from a matched span to a JSON-formatted entry for your "spans" is pretty straightforward:

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    label = doc.vocab.strings[match_id]
    span_dict = {"start": span.start_char, "end": span.end_char, "label": label}

If your patterns can produce overlapping or conflicting matches, make sure to filter them so you don't end up with overlapping entity spans. You can use spaCy's filter_spans utility for that.

For completeness, to answer your question in the thread title:

Whenever the queue is running low, the web app requests a new batch of examples (of size batch_size) from the back-end, which is then taken directly from the stream generator. So if there's nothing that's consuming larger chunks of the stream, only one batch at a time will be processed on the server at a time, and then send to the front-end.

1 Like

Thanks @ines, I'll switch over to the PhraseMatcher for now.

1 Like

Quick update: just released v1.10.8, which introduces a batch_size argument on PatternMatcher.__call__. So you can now customise that according to what you need :slightly_smiling_face: