Categorisation in foreign languages - are word vectors enough?

Hi, thanks for the inspiring work!

I’m doing categorisation of Twitter/Facebook messages in Finnish (slightly unfortunate, I know…)

There are some precompiled sets of word vectors that should reflect this type of corpus fairly well. I have taken some of these, converted them into the form supported by spacy (with spacy init-model), and then tried using the model with the custom vectors in terms.teach and textcat.teach.

The categorisation interface works fine, but the scores feel a bit random even after a few hundred categorisations. My first impression is that the score tends to decrease for all texts and the model doesn’t seem to discriminate between texts consistently.

My questions:

  1. Is it sufficient to just pick up custom word vectors and start categorising for a language like Finnish? As I understand it, spacy supports Finnish in a rudimentary way, and e.g. a tokenizer seems to be attached to the model after running spacy init-model with fi as a parameter. Does the spacy model work for text categorisation if the more complex modelling of the language is not supported (like tag maps etc.)

  2. Is there a way that I could debug or examine the models that run in the background of textcat.teach or are created as a result of textcat.batch-train? From spacy documentation, it seems like the default architecture for creating models is an ensemble model - does that mean that what is happening in the background is a combination of bag-of-words and a neural network running on the word vectors? Would the bag-of-words require lemmatisation to work properly, which I imagine isn’t support out of the box by the spacy model? Is there a way to check that the created models are utilising the custom word vectors properly?

Sorry for the many questions, and many thanks for your help :slight_smile:

I think short Finnish texts is probably a pretty hard case for the current model, and it would need a lot of examples to do well. You might need to use a custom pipeline that wraps other NLP or machine learning tools. We’d like to improve support, but as I’m sure you know, Finnish poses particular difficulties.

My understanding is that in Finnish, a social media post might consist of just one or two tokens, with morphological suffixes filling in a lot of the information that would be coded into function words in a language such as English. This is bad news for NLP techniques that lean on token identity as a strong meaning queue. I suspect that your model is struggling with the sparsity of the vocabulary and the low numbers of tokens per text, and that these together are making your task difficult to learn.

You should probably try switching off the active learning, and focus on using Prodigy just to label the texts, initially without much support from a predictive model. A patterns file might perform well as some limited tool assistance, depend on your use-case.

Once you have some data annotated, you could try out a few machine learning techniques. You could try out the ULMFiT implementation in Fast AI, or the language model pretraining that comes with spaCy v2.1 (which Prodigy will soon support too). You might also find some Finnish-specific NLP packages that are better tuned for Finnish text classification.

Another thing you can try is using a morphological analyser, such as the one provided by the Stanford NLP package. You could try lemmatizing the text and inserting some of the morphological features as tokens. This might give the textcat model an input sequence that’s a bit less sparse, which may perform better.

OK, thanks for the suggestion, I think it will work fine for us anyway to use Prodigy without active learning and using a custom model. I won’t try to understand the innards of the spacy model then if other models might perform better :slight_smile: