Adding blank new dependencies to train them

Hi! I'm working on upgrading the es_dep_news_trf model as much as possible, and it would be very convenient if there was a possibilty to add more dependencies with a blank value in order to train them from zero.

An example on why would I want to do this, would be to make the dependencies such as "obl" be a little more specific creating ones such as "pl_obl(meaning place oblique)" or "tm_obl(meaning time oblique)" and be able to replace them with the "obl" dependency based on the context.

This in much more situations would make the dependency parser a little richer, and hopefully more helpful.
I hope I explained myself well beacuse I'm pretty much an amateur in NLP and any response will be helpful!

Thanks in advance!

Welcome to the forum @elias :wave:

Adding to new categories to pre-trained models (especially big transformer language models) is usually pretty tricky. Since all of the pre-trained weights will have to be updated based on the new signal you are likely to run into some undesirable side-effects such as "catastrophic forgetting" problem. In this case the best practice is to train via "pseudo-rehearsal" i.e., to use the original model to label examples and mix them through your fine-tuning updates.
Please see our blog post on pseudo-rehearsal to learn more about it.

If you are trying to solve a concrete problem perhaps a better option would be to use es_dep_news_trf to help you to pre-annotate the data which you would then curate by reannotating the existing obl category e.g. using dep.correct recipe.. With this data you could train a smaller CNN spaCy pipeline with more fine-graned obl labels.

Thanks for the response @magdaaniol!

I have been trying to add the new dependencies, and I created this code in order to do it:

import prodigy
import spacy
from prodigy.components.loaders import JSONL

@prodigy.recipe("dep.manual.simple")
def dep_manual_simple(dataset, spacy_model, source):
    nlp = spacy.load(spacy_model)

    def add_tokens(stream):
        for eg in stream:
            doc = nlp(eg["text"])
            eg["tokens"] = [{"text": t.text, "start": t.idx, "end": t.idx + len(t.text), "id": i} for i, t in enumerate(doc)]
            eg["arcs"] = []
            yield eg

    stream = add_tokens(JSONL(source))

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "dep"
    }

However, the problem I encounter is that when executing the dep.correct recipe with this code, the interface appears with the sentence unparsed and with only the dependencies I added available to train. What I'd like to do if it were possible is to mix up my dependencies, with the already operative es_dep_news_trf dependencies and then use pseudo-rehearsal to prevent catastrophic forgetting. Would this be possible to do or could I just use the dep.correct recipe as is in order to train my dependencies from zero and mix them up later?

P.D. When trying to use the dep.manual recipe instead, what i got was the following error

TypeError: arc is undefined