block with classifier and ner

the blocks feature in your latest release - awesome!

I sometimes have tasks where I need to classify, and label entities. Would like to do that at the same time. Problem is in the config setting of my recipe how do I specify which labels are for the ner block and which ones are for the classification block?

thinking something like the following:

return {
        DATASET: dataset,
        VIEW_ID: BLOCKS,
        STREAM: stream,
        "config": {
            "blocks": [
                {"view_id": "ner_manual"},
                {"view_id": "classification"}
            ],
            "ner_labels": [
                V1, ARG1, 
                V2, ARG2,
                V3, ARG3, 
                ARG12, ARG13, ARG23, 
                V12, V13, V23
            ],
           "cls_labels":  ["TRUE", "FALSE"]
        }
    }

Any ideas?

Thanks, that's nice to hear :smiley:

You can set "labels" on the block (just like html_template – see the table in this section. However, this will be equivalent to returning "config": {"labels": [...]} by your recipe, which is what you'd to for ner_manual or image_manual.

The classification interface only needs a "label" property on each task. Or, if you're using the choice interface, you can add a list of "options". So there shouldn't really be a conflict here and you could buuld something similar to the cat facts example here, just without the input field. Or maybe I'm misunderstanding your use case?

Missed the the cat facts example and that sorts me out; thanks much @ines.

Just to share where I goto and the use case incase there are better ways or this helps someone else. Here is what I have:

My Recipe:

@prodigy.recipe(
    'srl',
    dataset=("Dataset target", "positional", None, str),
    model=('Language model', 'positional', None, str),
    source=("path to data", "positional", None, str)
)
def srl(dataset, model, source):

    nlp = spacy.load(model)

    def get_tasks(source):
        with jsonlines.open(source, 'r') as rdr:
            for eg in rdr:
                yield {
                    TEXT: eg[TEXT],
                    TOKENS: eg[TOKENS],
                    OPTIONS: [
                        {ID: BUY, TEXT: BUY},
                        {ID: SELL, TEXT: SELL}
                    ],
                    HTML: displacy.render(nlp(eg[TEXT]), style='dep', page=True)
                }

    stream = get_tasks(source)

    return {
        DATASET: dataset,
        VIEW_ID: BLOCKS,
        STREAM: stream,
        CONFIG: {
            BLOCKS: [
                {VIEW_ID: NER_MANUAL},
                {VIEW_ID: CHOICE},
                {VIEW_ID: HTML}
            ],
            LABELS: [
                V1, ARG1, 
                V2, ARG2,
                V3, ARG3, 
                ARG12, ARG13, ARG23, 
                V12, V13, V23
            ],
        }
    }

A screenshot of the resulting interface:

Small remaining issue is if I am using keyboard shortcuts (the only way to fly!), then if I hit 1 both V1 in the ner labels and BUY in the choice labels are selected. This is not really a big deal; if I do my ner tagging before my choice tagging, everything works fine...just wanted to point it out in case there is abetter way to implement.


Also just to share the full use case. I am using prodigy to harvest data on semantic dependencies. I will use the dataset with the spacy parser model (your chat semantics recipie), as well as some other implementations as I don't think the spacy dependancy parser can represent some of the nuances I am encountering. For example the V* are predicate heads, and the ARG* are predicate arguments. The reason I have ARG12 is for elliptical cases where a single argument is acting in two structures. For example here is a typical sentence in my data:

"I paid $50 for the Ken Griffey Jr Card ... I am offered now at $65"

Here the "Ken Griffey Jr Card" is an argument for both the predicate "paid" and the predicate "offered". So I would annotate "paid" as V1 and "offered" as V2; while the chunk "Ken Griffey Jr Card" gets the ARG12, meaning its an argument for V1 and V2. I can translate that into CoNLL format latter which can represent overlapping spans, where the Spacy parser cannot (I could be wrong about this?).

This looks cool, thanks for sharing! Love the displaCy integration :heart_eyes:

You could use the "keymap_by_label" config setting (requires v1.9.4) to set custom shortcuts for your labels – or custom shortcuts for your options. For example, {"1": "b", "2": "s"} would change the option keys to b for "buy" and s for "sell".

How are you representing the spans in the CoNLL format and what are you training with it?

spaCy currently doesn't have a built-in component that just predicts arbitrary non-entity sequences – only a named entity recognition model. If you're training a named entity recognizer, a token can only be part of one entity – and the BIO and BILUO scheme can also only represent one label per token, so the data format expects each token to have one entity label.

perfect! Thx.

Regarding the conll representation. Its a hack so that I can force my problem in the AllenNLP SRL model. The CoNLL-2012 format allows for an arbitrary number of columns between the entity column and the co-reference column to represent more than one verb clause. So using my sentence above I am going to try representing as follows:

http://conll.cemantix.org/2012/data.html

Notice in the grey boxes that the span "Ken Griffey Jr Card" is an argument in two columns each representing a span; so overlapping spans, which I suspect should be avoided, but its all over my datasets of dialogue.

I am not sure if the the AllenNLP architecture will accommodate the implied overlapping span there. Assuming I get something reasonable for sentences like the above I will then include that model towards the end of a spacy pipeline. If I am well off course please let me know, otherwise I will definitely post how it goes, specifically how that process deals with the example sentence above.