Working with languages not yet supported by Spacy

textcat
spacy
solved

(Espen Jütte) #1

I guess this is a mixed Spacy/prodigy-crossover question, but i’m looking into working with Norwegian language models, specifically for text classifcation tasks. As far as i can see support for this in Spacy is pretty sparse (briefly appared in version 1.9 i think?).

What do i have to add to Spacy to make a minimally working Norwegian model for text classifcation in prodigy? I recently saw that you could import FastText vectors, that might take care of a bit of the work.

(There is also the polyglot project which does support Norwegian for most NLP-tasks, is there a way to use this a base for a spacy model ? http://polyglot.readthedocs.io )


(Ines Montani) #2

We’ve actually had some pretty good contributions to the Norwegian language data, thanks to the community – so tokenization should work quite well out-of-the-box. Any model you save out from spaCy can be used directly with Prodigy – so you can easily get a blank Norwegian model, and then use Prodigy to add and train the TextCategorizer component:

from spacy.lang.nb import Norwegian
nlp = Norwegian()  # alternatively: nlp = spacy.blank('nb')
nlp.to_disk('/path/to/nb-model')
prodigy textcat.teach nb_dataset /path/to/nb-model my_data.jsonl --label SOME_LABEL

Vectors are always nice, though. You can either import the FastText ones, or train your own. If you have a corpus with lots of Norwegian text, you can use the terms.train-vectors recipe (see here for details). This will let you bootstrap lists of seed terms from your vectors using terms.teach, to speed up the annotation process and get over the “cold start” problem.

Integrating Polyglot into spaCy will take a little more work, because you’ll have to dig into the internals. (The new custom pipeline components and attributes might help a lot, though!) You could also use Polyglot to pre-process your text or extract examples for annotation. Here’s a simple dummy recipe that shows a function extracting sentiment scores and wrapping a stream of annotation tasks:

import prodigy
from prodigy.components.loaders import JSONL

def add_sentiment(stream):
    for eg in stream:
        sent_score = get_sent_score_from_polyglot(eg['text'])  # extract a score for the text
        # add it to the task – you'll likely want to do this more elegantly ;)
        eg['label'] = 'POSITIVE' if sent_score > 0.5 else 'NEGATIVE'
        yield eg

@prodigy.recipe('sentiment-analysis')
def sentiment_analysis(dataset, source):
    stream = JSONL(stream)
    return {
        'dataset': dataset,  # add annotations to this dataset
        'stream': add_setiment(stream),  # add labels to stream based on sentiment 
        'view_id': 'classification'  # annotate in classification mode
    }

The data produced by the recipe can then be used to train your spaCy model.


(Espen Jütte) #3

Thanks for the extensive reply, very helpful.

I did in fact try this out using fastText-vectors, using a variation over the script posted in the FastText thread here. Basically: https://gist.github.com/anonymous/4b0ee04cef7f703beb2f539d5577299f

I’we also verified that this works as expected in spacy using some test-words.

I then try to launch a prodigy-session with some seed-words:
prodigy terms.teach fruits_dataset norsk.model/ --seeds banan,eple
(with norsk being a freshly created db)

When i then start the prodigy-session in my browser it just says “loading”. If i turn on logging the last thing i see is “00:01:45 - CONTROLLER: Iterating over stream”.

As far as i understand i should be able to use teach.terms to teach the model what terms are related to the label i’m looking for, using the seed terms to create an initial list and the vectors from fastText as help for the model to understand what words are related. So i was expecting prodigy to give me words related to my seed words. Am i just missunderstanding teach.terms and the vectors, or did i do something wrong in setting up the model?


(Matthew Honnibal) #4

My first guess was that the spaCy model you’re loading in hasn’t loaded the vectors correctly. If the vectors aren’t loaded, the similarity would be 0.0 for all words. Currently I think this would result in it fruitlessly looping over the vocabulary. However, your script looks correct.

I then had another look at the terms.teach recipe. It looks like it checks the vocabulary entries for is_lower and is_alpha. Could you check whether these attributes are set correctly in your vocabulary?

You should also be able to add some print statements in the recipe to help figure out what’s going wrong. If you want to see how the recipe is supposed to work, try it with some English terms using the en_core_web_lg model.


(Espen Jütte) #5

Could be, just to be sure i tested the following after my initial gist:
text = ‘banan eple’
doc = nlp(text)
print(text, doc[0].similarity(doc[1]))

This gives me: banan eple 0.342116

So at least in spacy the model seems to be giving distances when asked, at least in some way. But i think looping over the vocabulary fruitlessly seems to be correct, as it seems to be using a full core until interrupted.


(Matthew Honnibal) #6

Sorry, was editing my post while you replied — have a look for further suggestions :slight_smile:


(Espen Jütte) #7

nlp.vocab.getitem(“banan”).is_lower
True

nlp.vocab.getitem(“Banan”).is_lower
False

nlp.vocab.getitem(“banan”).is_alpha
True

This seems to be working correctly as far as i can see.

(This is while working with the vocab in Spacy before using to_disk, not from the prodigy recepie. So the error could come from saving and loading?)

EDIT:
Just tested in a recepie.py file:
import spacy
nlp = spacy.load(‘norsk.model’)
print(nlp.vocab.getitem(“Tid”).is_lower)

This correctly prints False. So model loading/saving seems to be okay in this regard.


(Matthew Honnibal) #8

And you checked similarity after loading your model too? Hmm.

Btw, you should check a value that’s True for the lexical attributes after loading — the default will be false, so checking False values doesn’t guarantee the attributes were set.


(Espen Jütte) #9

Just to be perfectly safe, i did some extra tests:

Distance works fine in the recepie.py file (ie after loading in prodigy), and i did test a value that returned true. So indeed: distance works after loading the model, and it returns True for is_lower as expected after loading.

Also some odd behavior after further testing:

  1. if i generate a new model from the fastText vectors,
  2. save the model
  3. then create a new database
  4. create a new database
  5. create a new prodigy session with terms.teach it just keeps loading forever.

Odd behavior:

  1. If i generate a new model from the fastText vectors
  2. do a a similarity between two words to test it
  3. save the model,
  4. create a new database
  5. create a new prodigy session with terms.teach
    Then get the two words i used to test similarity as suggestions in prodigy. It does seem to ignore the seed words and any other potential words.

I’m using terms.teach with —seed. It seems doing a .similiarty somehow “initialized” the vectors, and after i save these seems to be the only vectors the model can access.

Code used for similiary test:
text = ‘banan eple’
doc = nlp(text)
print(text, doc[0].similarity(doc[1]))

Maybe this helps?


(Espen Jütte) #10

I’we been digging through the problems a bit, and i think i’m getting pretty close, but this requires some more in-depth Spacy experience again @honnibal.

The original code i used to import the FastText vectors is a slightly modified version of this:

This uses set_vector to set the vectors of certain words extracted from the FastText .vec file. What i find is that while the word vector gets added. So i can use get.vector(), but it does not show up when you itterate through the vocabulary. Basically rendering this code from terms.teach useless:

lexemes = [lex for lex in nlp.vocab if lex.is_alpha and lex.is_lower]

Check out the replicable test-code: https://gist.github.com/anonymous/33d091c12fddb5dd10cc41dbb9d728e6

So what i’m guessing is that the original code used to import FastText vectors only sets the vectors, but somehow fails to add them as proper entries into the model dictionary. So i’m guessing one should use on function to adde the word and then another to set the vectors.

Any thoughts?


(Matthew Honnibal) #11

Thanks for the analysis!

I think the problem is that nlp.vocab.set_vector should be adding the word to the vocab, but isn’t. This is a bug in spaCy.

For now, you can work around the problem by adding the word to the vocab explicitly. Adding the line lex = nlp.vocab[orth] before setting the vector should take care of this.


(Espen Jütte) #12

And it works, it fetches and presents similar words as expected! very nice :>

Thanks for the help and patience, i’we learned a lot about Spacy and Prodigy-internals, which im sure will come in handy. You should probeably update the example-code so that the next person does not have a similar issue.


(Andrely) #13

To piggyback on this for a moment.

We have Norwegian spacy support created with “spacy package”.

We will be creating NER models for Norwegian and are looking at prodigy for this purpose.

  • What would be the steps to integrate such packages with prodigy?
  • Do prodigy use any part of spacy beyond the tokenizer?
  • I assume prodigy only support spacy 2 (we are due to upgrade our 1.9 models soon)?

-andré


(Ines Montani) #14

Ah, cool – so many Norwegians :blush: (Norwegian was actually one of the first non-English languages we tested Prodigy with, as part of a project we worked on with a Norwegian company.)

That’s very simple and will work out-of-the-box – just install your model in the same environment, and you’ll be able to load it into Prodigy using its name. For example:

prodigy ner.teach nb_dataset your_norwegian_model data.jsonl

This depends on the recipe and what you’re looking to do – some recipes support splitting long documents into sentences and use spaCy to accomplish that. If you’re training an NER model or text classification model based on a spaCy model, the respective spaCy components will be used.

That said, you can also use Prodigy without spaCy and plug in any other models or libraries via custom recipes. Recipes are simple Python scripts, so everything you can load in Python can be used with Prodigy. All you need is a function that takes a stream and returns sortable (score, example) tuples, and an update() function that takes annotated examples and updates your model. (See the custom recipes workflow for more details and examples).

Yes, Prodigy uses neural network models, so it requires spaCy v2.0+. The new version makes it easy to update existing models at runtime, even with small numbers of examples. This feature is very important to make Prodigy work as efficiently as it does. (Btw, in case you haven’t seen it already: the spacy v2.0 migration guide will hopefully make upgrading easier. We’ve also made a lot of improvements to the training process.)


(Andrely) #15

We got the new Norwegian 2.0 model which is looking very good and a lot better then the 1.9 model we had earlier.

Also got a prodigy license and are creating a NER dataset.

We had to create a custom recipe since we use a package model and custom Language class due to the licensing of our lemmatizer data. That config.view_id tripped me up for a moment but now your very nice annotation interface is up and running :slight_smile:

Looking forward to seeing this product expanded even further. Personally I need asn audio annotation tool but I guess that’s not on the roadmap :wink: Keep up the good work!


(Ines Montani) #16

@andrely Yay, thanks for updating, that’s nice to hear!

I’m actually quite intrigued by audio annotation, but you’re right, it’s definitely not part of the immediate roadmap. We first want to finish the text and image-based annotation workflows.

That said, if you don’t need to do manual audio segmentation and just want to collect annotations with audio data (like, whether the quality is good, whether an automated transcription is correct), you can already achieve this with a custom HTML interface ('view_id': 'html'). For example, given input tasks like this:

{"audio_file": "/path/to/audio.mp3", "text": "This is a transcript"}

… you could use a html_template that looks like this and implements the text and a simple HTML5 audio player:

<p>{{text}}</p>
<audio><source src="{{audio_file}}" type="audio/mpeg"></audio>

(Kristen Sheppard) #17

Hi Ines!

Something strange happens when I try to implement the audio example above. I’m using prodigy version 1.5.1. My view_id is set to ‘html’ in the recipe.py file.

My audio.jsonl looks like this:
{“audio_file”: “~/path/to/file/from/home/directory/241743_left_8khz.wav”, “text”: “some simple text here”}

My html_template in my prodigy.json file looks like this:
“instructions”: “instructions.html”,
“choice_auto_accept”: true,
“html_template”: "

<p>{{text}}</p>
<audio controls><source src={{audio_file}} type=audio/wav></audio>

"

This renders the template with audio controls but no audio plays (i.e. when you press the play button nothing happens). I call it using this recipe:

prodigy mark AudioTranscription audio.jsonl --view-id html

If I add the quotation marks like so:

<audio controls><source src="{{audio_file}}" type="audio/wav"></audio>

Then prodigy throws an error in the terminal - ValueError: Unexpected character in found when decoding object value

I’ve tried several different ways to get the audio to import. Any ideas why the audio may not be importing as expected or why the quotation marks might be causing the error I mentioned above (see below)?
audio_terminal

Appreciate any pointers and thanks for a great product!


(Ines Montani) #18

@kshepp Hi! So the HTML template definitely looks good and the fact that it renders as expected is a good sign :slight_smile:

I think the problem might be in the file path: If you want to load a local file in your browser, you usually have to prefix it with file:/// (with 3 forward slashes). You should be able to test this and find the correct path by opening the audio file in your browser.

About the error / quotes: The “Unexpected character” error usually means that the JSON reader came across invalid JSON – here, this happens when Prodigy tries to load your prodigy.json and it’s because JSON expects quotation marks around the values. So in your HTML, you’d either have to use single quotes in the HTML, or escape the quotes, like this:

"html_template": "<audio controls><source src=\"{{audio_file}}\" type=\"audio/wav\"></audio>" 

(Kristen Sheppard) #19

Thanks Ines! With those changes everything renders correctly.

I still had issues with the audio plays but it has to do with Chrome/Firefox (not Prodigy). If anyone else has this issue check out this thread as it solved my problem with rendering audio tasks on my local host for testing -

Thanks again!