Here's an idea for your first question, in regards to preserving information (if I understand you correctly). I had a similar issue, except mine had to do with the formatting of the document. I needed to strip some HTML formatting, processes the text through a pipeline, then reconstruct the formatting with the output. I imagine it might be similar to what you're looking to do, maybe changing some letter accents, process the data, and reconstruct it back to it's original form.
The approach that I took was I created a custom token extensions and stored the original information in the extension https://spacy.io/api/token#set_extension. Then I altered the Doc object and removed the original information. Let me give you an example of what I mean.
I changed the formatting to a token that I know I would not find in my corpus at all. In your case, in the preprocessing stage let's say you want to change "ACME SocietĂ " to "ACME society" before you run it through your NER or CAT pipeline. What I did was I changed it to something like this "ACME society /og_word/=SocietĂ ". So during the tokenizing pipeline I would have three tokens ["ACME", "society", "/og_word/=SocietĂ "]. I used a custom token extension to store the information on the previous token. Here is some working sudo code that illustrates what I mean.
import spacy
from spacy.tokens import Token
from spacy.language import Language
Token.set_extension("changed_word", default="none")
@Language.component("formatting")
def formatting(doc):
for token in doc:
if "/og_word/=" in token.text:
text = token.text.replace("/og_word/=", "")
doc[token.i - 1]._.changed_word = text
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("formatting", after="tok2vec")
doc = nlp("The company ACME society /og_word/=SocietĂ ")
for t in doc:
print(t.text, t._.changed_word)
Output is
The none
company none
ACME none
society SocietĂ
/og_word/=SocietĂ none
You've now stored the original format in the previous token. The only thing left to do is to edit the doc object in the "formatting" method (I know it's frowned upon to edit the doc object). I've borrowed the code from A function to delete tokens from a spacy Doc object without losing associated information (PartOfSpeech, Dependance, Lemma, ...) · GitHub
Final working sudo code:
import spacy
from spacy.tokens import Token
from spacy.language import Language, Doc
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA, DEP, LEMMA, LOWER, IS_PUNCT, IS_DIGIT, IS_SPACE, IS_STOP
import numpy as np
Token.set_extension("changed_word", default="none")
@Language.component("formatting")
def formatting(doc):
list_attr=[LOWER, POS, ENT_TYPE, IS_ALPHA, DEP, LEMMA, LOWER, IS_PUNCT, IS_DIGIT, IS_SPACE, IS_STOP]
index_to_del = []
for token in doc:
if "/og_word/=" in token.text:
text = token.text.replace("/og_word/=", "")
doc[token.i - 1]._.changed_word = text
index_to_del.append(token.i)
np_array = doc.to_array(list_attr)
mask_to_del = np.ones(len(np_array), np.bool)
for index in index_to_del:
mask_to_del[index] = 0
np_array_2 = np_array[mask_to_del]
doc2 = Doc(doc.vocab, words=[t.text for t in doc if t.i not in index_to_del])
doc2.from_array(list_attr, np_array_2)
arr = np.arange(len(doc))
new_index_to_old = arr[mask_to_del]
doc_offset_2_token = {tok.idx : tok.i for tok in doc} # needed for the user data
doc2_token_2_offset = {tok.i : tok.idx for tok in doc2} # needed for the user data
new_user_data = {}
for ((prefix, ext_name, offset, x), val) in doc.user_data.items():
old_token_index = doc_offset_2_token[offset]
new_token_index = np.where(new_index_to_old == old_token_index)[0]
if new_token_index.size == 0: # Case this index was deleted
continue
new_char_index = doc2_token_2_offset[new_token_index[0]]
new_user_data[(prefix, ext_name, new_char_index, x)] = val
doc2.user_data = new_user_data
return doc2
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("formatting", after="tok2vec")
doc = nlp("The company ACME society /og_word/=SocietĂ ")
for t in doc:
print(t.text, t._.changed_word)
Final output:
The none
company none
ACME none
society SocietĂ
Now you have the original format stored in the token extension before it gets processed in the NER/CAT pipeline.
Hope that helps!
*Edit had the original and the changed words switched