Best Approach for My Project

Hello,

I am working on a project that involves company names from all over the world as text.
We basically want to define a pipeline in Spacy that includes cleaning such as removal of accents, removal of special characters, stop words, then we want to lowercase everything and we want to do some special lemmatization. My first question is: what is the best way to do so?
What we have done is to define some functions using Python such as the lower, or replace characters, and then we have defined Spacy components that use these functions. The fact is that I think in this way I am not preserving any info, so I am not able to reconstruct the original company name. How can I do that integrating everything with Spacy?

Then the second question is related to the tagging of the different components of each company name. We are using Prodigy to do that, and our idea is to tag manually a certain amount of company names and then train the NER model. Is it possible once the model is trained to have an active learning approach? I mean that then I will need to check and tag just the company names for which the NER model is uncertain while for the other ones for which the model has a good confidence, I don't have to check anything? There is any other way I can speed up the process?

Thank you very much for your help, it's really appreciated and important to us.

Regards,
Mauro

Hi! Are you sure you want to do this processing as part of your pipeline and not as a post-process? Normalising text can be reasonable but removing stop words, lowercasing etc. typically isn't recommended for modern neural network models because you're destroying a lot of potentially very useful information this way. For instance, capitalisation and stop words can definitely be relevant for text classification, identifying named entities etc. So I'd recommend keeping your text the way it is, only focusing on normalisation as a pre-process and performing lowercasing etc. at the very end if you really need it.

Yes, this approach sounds reasonable. I'd probably recommend collecting a small dataset of examples manually first, to get a feeling for the data. Then you can train your first model and check whether it's learning, or even run a diagnostic like Prodigy's train-curve to see if more similar examples are likely going to improve the model further. You can then either use workflows like ner.correct with your model in the loop to let it highlight entities for you so you only have to correct mistakes and see how your model does on unseen examples. You can also improve it further using a workflow like ner.teach that will only ask you about selected specific entity spans with uncertain scores, which can be helpful for edge cases.

2 Likes

Hi @ines

thank you for your reply, that is very useful.
I just want to be sure I got everything, also I'll explain better what we aim to do.

What we do want to achieve is to classify the company names with their Legal Form, so let's say the input is Apple Inc, and our model classify it as Incorporated or corporation.
In doing so we want to basically use a mixed model, so we want to leverage the NER model, and also maybe including a bag of words. It is for the bag of words that I think that the processing of the text is important because we are dealing with multiple jurisdictions.
So for example we want to reduce "limited" and "ltd" to the same token, or we want the Italian "spa" and "s.p.a" to be the same.
Also for example in Italy, the word society is written "SocietĂ " and so removing accents can be important. As well reducing to lowercase I believe also, so that "Limited" and "limited" are the same word.

Having said that, your suggestion is really really useful, and please correct me if I am wrong, the best way to approach the problem could be to:

  • take the company names as they are
  • (eventually apply some lemmatization)
  • train the NER model and tag the text
  • process the text to reduce a bit the vocabulary
  • define a bag of word
  • use the features to train a classifier

Does it sound reasonable to you?
Also, can I integrate the processing (removal of accents, lowercase, ecc) for the bag of words in Spacy?

Thank you very much!

Here's an idea for your first question, in regards to preserving information (if I understand you correctly). I had a similar issue, except mine had to do with the formatting of the document. I needed to strip some HTML formatting, processes the text through a pipeline, then reconstruct the formatting with the output. I imagine it might be similar to what you're looking to do, maybe changing some letter accents, process the data, and reconstruct it back to it's original form.

The approach that I took was I created a custom token extensions and stored the original information in the extension https://spacy.io/api/token#set_extension. Then I altered the Doc object and removed the original information. Let me give you an example of what I mean.

I changed the formatting to a token that I know I would not find in my corpus at all. In your case, in the preprocessing stage let's say you want to change "ACME SocietĂ " to "ACME society" before you run it through your NER or CAT pipeline. What I did was I changed it to something like this "ACME society /og_word/=SocietĂ ". So during the tokenizing pipeline I would have three tokens ["ACME", "society", "/og_word/=SocietĂ "]. I used a custom token extension to store the information on the previous token. Here is some working sudo code that illustrates what I mean.

import spacy
from spacy.tokens import Token
from spacy.language import Language

Token.set_extension("changed_word", default="none")

@Language.component("formatting")
def formatting(doc):
    for token in doc:
        if "/og_word/=" in token.text:
            text = token.text.replace("/og_word/=", "")
            doc[token.i - 1]._.changed_word = text
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("formatting", after="tok2vec")

doc = nlp("The company ACME society /og_word/=SocietĂ ")
for t in doc:
    print(t.text, t._.changed_word)

Output is

The none
company none
ACME none
society SocietĂ 
/og_word/=SocietĂ  none

You've now stored the original format in the previous token. The only thing left to do is to edit the doc object in the "formatting" method (I know it's frowned upon to edit the doc object). I've borrowed the code from A function to delete tokens from a spacy Doc object without losing associated information (PartOfSpeech, Dependance, Lemma, ...) · GitHub

Final working sudo code:

import spacy
from spacy.tokens import Token
from spacy.language import Language, Doc
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA, DEP, LEMMA, LOWER, IS_PUNCT, IS_DIGIT, IS_SPACE, IS_STOP
import numpy as np

Token.set_extension("changed_word", default="none")

@Language.component("formatting")
def formatting(doc):
    list_attr=[LOWER, POS, ENT_TYPE, IS_ALPHA, DEP, LEMMA, LOWER, IS_PUNCT, IS_DIGIT, IS_SPACE, IS_STOP]
    
    index_to_del = []

    for token in doc:
        if "/og_word/=" in token.text:
            text = token.text.replace("/og_word/=", "")
            doc[token.i - 1]._.changed_word = text
            index_to_del.append(token.i)
    

    np_array = doc.to_array(list_attr)
    mask_to_del = np.ones(len(np_array), np.bool)
    for index in index_to_del:        
        mask_to_del[index] = 0

    np_array_2 = np_array[mask_to_del]

    doc2 = Doc(doc.vocab, words=[t.text for t in doc if t.i not in index_to_del])
    doc2.from_array(list_attr, np_array_2)

    arr = np.arange(len(doc))
    new_index_to_old = arr[mask_to_del]
    doc_offset_2_token = {tok.idx : tok.i  for tok in doc}  # needed for the user data
    doc2_token_2_offset = {tok.i : tok.idx  for tok in doc2}  # needed for the user data
    new_user_data = {}
    for ((prefix, ext_name, offset, x), val) in doc.user_data.items():
        old_token_index = doc_offset_2_token[offset]
        new_token_index = np.where(new_index_to_old == old_token_index)[0]
        if new_token_index.size == 0:  # Case this index was deleted
            continue
        new_char_index = doc2_token_2_offset[new_token_index[0]]
        new_user_data[(prefix, ext_name, new_char_index, x)] = val
    doc2.user_data = new_user_data

    return doc2

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("formatting", after="tok2vec")

doc = nlp("The company ACME society /og_word/=SocietĂ ")
for t in doc:
    print(t.text, t._.changed_word)

Final output:

The none
company none
ACME none
society SocietĂ 

Now you have the original format stored in the token extension before it gets processed in the NER/CAT pipeline.

Hope that helps!

*Edit had the original and the changed words switched