For example, I would like to use the POS and DEP information to change fragments like "large and medium banks" to "large banks and medium banks" so that I can annotate each of "large banks" and "medium banks" with separate sets of NE or custom tags.
It's funny that you mention this, because I've been optimistic about that idea for quite some time. I can't promise a specific date for when we'll ship it, but I want to provide an option for this in spaCy, probably some time this year. It will involve writing rules on top of the dependency parse, likely using the dependency matcher. Currently I don't have a specific set of rules for you, or I would have shipped it already.
One thing I've always wanted to try is using a large language model like GPT-3 for this. It seems like a good case for using prompts, because I think the model would find it easy to learn the input/output pairs you want. I haven't tried this though.
In general I think the processing pipeline you want is something like this:
def correct_text(doc: Doc) -> str:
def parse_corrected(nlp: spacy.Language, messy_texts: Iterable[str]) -> Iterable[Doc]:
messy_docs = nlp.pipe(messy_texts)
# You probably want to save these out and look at them
corrected_texts = (correct_text(doc) for doc in messy_texts)
corrected_docs = nlp.pipe(corrected_texts)
yield from corrected_docs
So the hard part is the
correct_text function. I'd like to have more about this in spaCy, but we don't yet.