Hi! I'm working on upgrading the es_dep_news_trf model as much as possible, and it would be very convenient if there was a possibilty to add more dependencies with a blank value in order to train them from zero.
An example on why would I want to do this, would be to make the dependencies such as "obl" be a little more specific creating ones such as "pl_obl(meaning place oblique)" or "tm_obl(meaning time oblique)" and be able to replace them with the "obl" dependency based on the context.
This in much more situations would make the dependency parser a little richer, and hopefully more helpful.
I hope I explained myself well beacuse I'm pretty much an amateur in NLP and any response will be helpful!
Adding to new categories to pre-trained models (especially big transformer language models) is usually pretty tricky. Since all of the pre-trained weights will have to be updated based on the new signal you are likely to run into some undesirable side-effects such as "catastrophic forgetting" problem. In this case the best practice is to train via "pseudo-rehearsal" i.e., to use the original model to label examples and mix them through your fine-tuning updates.
Please see our blog post on pseudo-rehearsal to learn more about it.
If you are trying to solve a concrete problem perhaps a better option would be to use es_dep_news_trf to help you to pre-annotate the data which you would then curate by reannotating the existing obl category e.g. using dep.correct recipe.. With this data you could train a smaller CNN spaCy pipeline with more fine-graned obl labels.
I have been trying to add the new dependencies, and I created this code in order to do it:
import prodigy
import spacy
from prodigy.components.loaders import JSONL
@prodigy.recipe("dep.manual.simple")
def dep_manual_simple(dataset, spacy_model, source):
nlp = spacy.load(spacy_model)
def add_tokens(stream):
for eg in stream:
doc = nlp(eg["text"])
eg["tokens"] = [{"text": t.text, "start": t.idx, "end": t.idx + len(t.text), "id": i} for i, t in enumerate(doc)]
eg["arcs"] = []
yield eg
stream = add_tokens(JSONL(source))
return {
"dataset": dataset,
"stream": stream,
"view_id": "dep"
}
However, the problem I encounter is that when executing the dep.correct recipe with this code, the interface appears with the sentence unparsed and with only the dependencies I added available to train. What I'd like to do if it were possible is to mix up my dependencies, with the already operative es_dep_news_trf dependencies and then use pseudo-rehearsal to prevent catastrophic forgetting. Would this be possible to do or could I just use the dep.correct recipe as is in order to train my dependencies from zero and mix them up later?
P.D. When trying to use the dep.manual recipe instead, what i got was the following error