Modifying a document based on POS and DEP

dsr2021 · September 29, 2021, 12:27pm

I would like to preprocess a document so as to make it less ambiguous and easier for annotation later in the pipeline.

For example, I would like to use the POS and DEP information to change fragments like "large and medium banks" to "large banks and medium banks" so that I can annotate each of "large banks" and "medium banks" with separate sets of NE or custom tags.

Also, I would like to be able to fix wrong POS and DEP annotation in slightly incorrect sentences in a doc. For example, the wrong usage of apostrophes as in "ABCD has it's own technology", or "my friends bike", "my friend's bike is green" and "my right legs a little tender"

The question I have is, what is a recommended way to do this

honnibal · October 4, 2021, 11:48am

For example, I would like to use the POS and DEP information to change fragments like "large and medium banks" to "large banks and medium banks" so that I can annotate each of "large banks" and "medium banks" with separate sets of NE or custom tags.

It's funny that you mention this, because I've been optimistic about that idea for quite some time. I can't promise a specific date for when we'll ship it, but I want to provide an option for this in spaCy, probably some time this year. It will involve writing rules on top of the dependency parse, likely using the dependency matcher. Currently I don't have a specific set of rules for you, or I would have shipped it already.

One thing I've always wanted to try is using a large language model like GPT-3 for this. It seems like a good case for using prompts, because I think the model would find it easy to learn the input/output pairs you want. I haven't tried this though.

In general I think the processing pipeline you want is something like this:

def correct_text(doc: Doc) -> str:
    ...

def parse_corrected(nlp: spacy.Language, messy_texts: Iterable[str]) -> Iterable[Doc]:
    messy_docs = nlp.pipe(messy_texts)
    # You probably want to save these out and look at them
    corrected_texts = (correct_text(doc) for doc in messy_texts)
    corrected_docs = nlp.pipe(corrected_texts)
    yield from corrected_docs

So the hard part is the correct_text function. I'd like to have more about this in spaCy, but we don't yet.

Topic		Replies	Views
Pipeline for POS corrections and dep corrections usage , spacy , dep , pos	1	558	March 31, 2021
help - first process of annotation usage , ner , solved , pos	15	925	August 7, 2021
Allcaps text, missed punctuation: improve sentence splitting? usage , spacy	3	773	September 13, 2018
Dep.Teach doesn't use same tokenenization as pretrained model spacy , dep	13	1803	March 10, 2020
ignore strings for dependency parser spacy , solved , dep	3	686	May 9, 2018

Modifying a document based on POS and DEP

Related topics