Adding blank new dependencies to train them

Hi! I'm working on upgrading the es_dep_news_trf model as much as possible, and it would be very convenient if there was a possibilty to add more dependencies with a blank value in order to train them from zero.

An example on why would I want to do this, would be to make the dependencies such as "obl" be a little more specific creating ones such as "pl_obl(meaning place oblique)" or "tm_obl(meaning time oblique)" and be able to replace them with the "obl" dependency based on the context.

This in much more situations would make the dependency parser a little richer, and hopefully more helpful.
I hope I explained myself well beacuse I'm pretty much an amateur in NLP and any response will be helpful!

Thanks in advance!

Welcome to the forum @elias :wave:

Adding to new categories to pre-trained models (especially big transformer language models) is usually pretty tricky. Since all of the pre-trained weights will have to be updated based on the new signal you are likely to run into some undesirable side-effects such as "catastrophic forgetting" problem. In this case the best practice is to train via "pseudo-rehearsal" i.e., to use the original model to label examples and mix them through your fine-tuning updates.
Please see our blog post on pseudo-rehearsal to learn more about it.

If you are trying to solve a concrete problem perhaps a better option would be to use es_dep_news_trf to help you to pre-annotate the data which you would then curate by reannotating the existing obl category e.g. using dep.correct recipe.. With this data you could train a smaller CNN spaCy pipeline with more fine-graned obl labels.

Thanks for the response @magdaaniol!

I have been trying to add the new dependencies, and I created this code in order to do it:

import prodigy
import spacy
from prodigy.components.loaders import JSONL

@prodigy.recipe("dep.manual.simple")
def dep_manual_simple(dataset, spacy_model, source):
    nlp = spacy.load(spacy_model)

    def add_tokens(stream):
        for eg in stream:
            doc = nlp(eg["text"])
            eg["tokens"] = [{"text": t.text, "start": t.idx, "end": t.idx + len(t.text), "id": i} for i, t in enumerate(doc)]
            eg["arcs"] = []
            yield eg

    stream = add_tokens(JSONL(source))

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "dep"
    }

However, the problem I encounter is that when executing the dep.correct recipe with this code, the interface appears with the sentence unparsed and with only the dependencies I added available to train. What I'd like to do if it were possible is to mix up my dependencies, with the already operative es_dep_news_trf dependencies and then use pseudo-rehearsal to prevent catastrophic forgetting. Would this be possible to do or could I just use the dep.correct recipe as is in order to train my dependencies from zero and mix them up later?

P.D. When trying to use the dep.manual recipe instead, what i got was the following error

TypeError: arc is undefined

I will specify my question a bit more, as I have learned some things from the time I posted the last question.

So I have, this code, which uses a recipe to implement my desired new dependencies:

import prodigy
import spacy
from prodigy.components.loaders import JSONL

@prodigy.recipe("dep.manual.custom")
def dep_manual_custom(dataset, spacy_model, source):
    nlp = spacy.load(spacy_model)
    
    # Retrieve the original dependency labels from the loaded model
    original_dep_labels = nlp.get_pipe("parser").labels
    
    # Combine original labels with custom ones
    custom_dep_labels = ["pl_obl", "tm_obl"]
    combined_labels = list(original_dep_labels) + custom_dep_labels

    def add_tokens(stream):
        for eg in stream:
            doc = nlp(eg["text"])
            eg["tokens"] = [{"text": t.text, "start": t.idx, "end": t.idx + len(t.text), "id": i} for i, t in enumerate(doc)]
            eg["arcs"] = []  # Initialize empty arcs for manual annotation
            yield eg

    stream = add_tokens(JSONL(source))  # Load examples from JSONL source

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "dep",  # Use the dependency view
        "config": {
            "labels": combined_labels,  # Include both original and custom labels
            "span_labels": combined_labels,
            "optimize_typeahead": True,
            "show_flag": False,
            "exclude_by_input_hash": True
        }
    }

I used the following command, which intends to add my custom dependencies to be trained along with the already trained es_dep_news_trf model dependencies.

python -m prodigy dep.correct my_dataset es_dep_news_trf my_data_source.jsonl --label "pl_obl,tm_obl" -F custom_deps.py

However, this only made my custom dependencies able to be trained from zero, without the underlying characteristics of the es_dep_news_trf model. I tried removing --label "pl_obl,tm_obl" from the command as that might have made the recipe only consider my custom dependencies, but then, only the original es_dep_news_trf deps appeared.

I'd appreciate any sort of guidance on this matter and I'm sorry if the previous question was difficult to understand, I'm still figuringthis out.

Hi @elias,

Resuming training in order to add a label is something that's supported in spaCy, although the documentation is a bit sparse. We do have an example project, and a few resources. Here's a project file that shows the process with NER: projects/pipelines/ner_demo_update at v3 · explosion/projects · GitHub

We don't have an identical guide for dependency parsing, but the process should be similar.

Overall I'd recommend the following workflow:

  1. Label your new dependencies. I would do this with the model making predictions over your text, but without updating the model as you go. This removes any need to think about training or updates from Prodigy. You just need to prepare a stream with the samples to annotate. Presumably you'll want to skip all sentences that don't have the label you're intending to split. If you just want to divide a label, I would probably do token classification rather than the dependency interface. This won't show you the parse, but will give you a faster and less noisy interface.
  2. Export the data. If you took the token classification approach, you'll have to do some data manipulation to get it into a parse. If you need to do this sort of manipulation, I'd recommend writing a script and modify the spaCy Doc objects, so that you can save a spaCy DocBin file with your fixed-up gold data.
  3. Use spaCy to train the model. Prodigy has functionality to quickly wrap spaCy's training commands, but for more complex use-cases like this I would definitely recommend exporting to spaCy.