We are currently creating a relations recipe, which consists of two blocks. We want the relations block to only consist of the numerical tokens/spans of the text (as those tokens are what we are creating relationships between) and we also want to have a ner block, consisting of the full, labelled text of the current example, so we have the full context of the current example and can create better relationships. The way we tried to achieve this was to have a full stream (consisting of the full task essentially) and a filtered stream (consisting of the numerical tokens), and we are trying to pass the full stream to the ner block, and the filtered stream to the relations block. We have been trying to do this, but have failed multiple times. Anyone know how we can achieve this? We are open to both a solution to having multiple streams, or if anyone has any better ideas as to how to have these two blocks containing different stuff.
Welcome to the forum @SanVijey!
The front-end is expecting a single stream of tasks to render, so for most UIs, including blocks
, it's not possible to have it receive multiple streams (as for the most cases it shouldn't be necessary). The exception is pages
, where you could define each component UI completely independently. I provide a pages
based solution below, but first I'd like to point out some simpler options.
I understand that you're trying to limit which tokens can be selected for relation annotations while preserving the entire sentence for context. Not sure if you've seen it, but therelations.manual
recipe lets you define the "disable" patterns for tokens. This lets you define patterns for tokens that should be unselectable in the UI while still remaining visible.
For example, with a pattern that disables anything that is not a number, you'd get a UI like this:
As you can see only the numbers are selectable, while the rest of the tokens is present but grayed out.
The pattern used in this example is:
{"label": "noNum","pattern": [{"LIKE_NUM": false}]}
You can also use entity labels as well as other spaCy token properties. See here for more details on pattern options.
One problem with this approach in your case could be that you also want to preserve the NER labels of the disabled tokens if any.
You could combine ner
with relations
in blocks, but since both UIs share the underlying token representations, the disable patterns would apply in both UIs.
If you want to implement a UI that combines ner
UI with relations
UI and keep them independent, you'd need to use pages
rather than blocks
.
In a way, pages
could support multiple streams in that you'd feed different data to your pages
creating function.
Here's an example of how that could look like.
In the recipe below I programmatically create a task for ner
and a task for relations
while keeping them completely independent:
import copy
from pathlib import Path
from typing import Any, Dict, List
import prodigy
import spacy
from prodigy.components.preprocess import add_tokens
from prodigy.components.stream import get_stream
from prodigy.core import Arg
from prodigy.recipes.rel import preprocess_stream, setup_matchers
from prodigy.types import StreamType
from prodigy.util import set_hashes
REL_LABELS = ["REL_LABEL"]
NER_LABELS = ["PERSON", "ORG"]
def create_ner_page(
text: str, tokens: List[str], spans: List[str], labels: List[str]
) -> Dict:
"""Create a ner page configuration."""
# make sure all tokens are visible
visible_tokens = []
for token in tokens:
token_copy = copy.deepcopy(token)
if token_copy.get("disabled"):
del token_copy["disabled"]
visible_tokens.append(token_copy)
return set_hashes(
{
"text": text,
"view_id": "ner",
"tokens": visible_tokens,
"spans": spans,
"config": {"labels": labels},
}
)
def create_relations_page(text: str, tokens: List[Dict], labels: List[str]) -> Dict:
"""Create a relations page configuration."""
return set_hashes(
{
"text": text,
"view_id": "relations",
"tokens": tokens,
"config": {"labels": labels, "wrap_relations": True},
}
)
def create_pages(example: Dict[str, Any]) -> Dict[str, Any]:
"""Create all pages for a given example."""
pages = [
create_ner_page(
text=example["text"],
tokens=example.get("tokens", []),
spans=example.get("spans", []),
labels=["PERSON", "ORG"],
),
create_relations_page(
text=example["text"], tokens=example.get("tokens", []), labels=["REL_LABEL"]
),
]
return set_hashes({"pages": pages})
def add_pages(stream: StreamType) -> StreamType:
"""Process the input stream and generate pages."""
for example in stream:
paginated_example = create_pages(example)
yield set_hashes(paginated_example)
@prodigy.recipe(
"test-recipe",
dataset=Arg(help="Dataset to save answers to."),
source=Arg(help="Input source"),
disable_patterns_path=Arg(help="Disable patterns path"),
)
def test_recipe(
dataset: str, source: str, disable_patterns_path: Path
) -> Dict[str, Any]:
"""
Process text files and create a multi-page annotation interface.
Args:
dataset: Name of the dataset to save annotations
source: Input source
disable_patterns_path: Path to the file containing disable patterns
Returns:
Dictionary containing recipe configuration
"""
stream = get_stream(source)
nlp = spacy.blank("en")
disable_matcher, disable_patterns = setup_matchers(nlp, disable_patterns_path)
# Process stream
stream.apply(add_tokens, stream=stream, nlp=nlp)
# Apply matcher rules to the stream
stream.apply(
preprocess_stream,
stream=stream,
nlp=nlp,
matcher=None,
disable_matcher=disable_matcher,
span_label=["PERSON", "ORG"],
add_nps=False,
add_ents=False,
)
stream = add_pages(stream=stream)
return {
"dataset": dataset,
"view_id": "pages",
"stream": stream,
"config": {
"custom_theme": {"cardMaxWidth": "90%"},
},
}
I reuse the disabling function from the relations
recipe that applies spaCy matcher rules to set the disabled
attribute on the tokens. This is undone by the function that created the ner
page to make sure all tokens are visible in the ner
UI.
The resulting UI looks like this:
Let me know if you need any clarification or have questions about implementing either approach!