I would like to preprocess a document so as to make it less ambiguous and easier for annotation later in the pipeline.
For example, I would like to use the POS and DEP information to change fragments like "large and medium banks" to "large banks and medium banks" so that I can annotate each of "large banks" and "medium banks" with separate sets of NE or custom tags.
Also, I would like to be able to fix wrong POS and DEP annotation in slightly incorrect sentences in a doc. For example, the wrong usage of apostrophes as in "ABCD has it's own technology", or "my friends bike", "my friend's bike is green" and "my right legs a little tender"
The question I have is, what is a recommended way to do this
For example, I would like to use the POS and DEP information to change fragments like "large and medium banks" to "large banks and medium banks" so that I can annotate each of "large banks" and "medium banks" with separate sets of NE or custom tags.
It's funny that you mention this, because I've been optimistic about that idea for quite some time. I can't promise a specific date for when we'll ship it, but I want to provide an option for this in spaCy, probably some time this year. It will involve writing rules on top of the dependency parse, likely using the dependency matcher. Currently I don't have a specific set of rules for you, or I would have shipped it already.
One thing I've always wanted to try is using a large language model like GPT-3 for this. It seems like a good case for using prompts, because I think the model would find it easy to learn the input/output pairs you want. I haven't tried this though.
In general I think the processing pipeline you want is something like this:
def correct_text(doc: Doc) -> str:
...
def parse_corrected(nlp: spacy.Language, messy_texts: Iterable[str]) -> Iterable[Doc]:
messy_docs = nlp.pipe(messy_texts)
# You probably want to save these out and look at them
corrected_texts = (correct_text(doc) for doc in messy_texts)
corrected_docs = nlp.pipe(corrected_texts)
yield from corrected_docs
So the hard part is the correct_text function. I'd like to have more about this in spaCy, but we don't yet.