Trouble training for Portuguese

I’m trying to train Spacy to recognize peoples names in Portuguese-language legal documents. I have a file (paragrafos.txt) with about 2000 paragraphs from legal contracts. I start with the standard Portuguese model and have been training from there. I annotated using:

prodigy ner.teach ner_nome pt_core_news_sm paragrafos.txt --label PER

After about 6000 annotations, I run batch-train and still get only a 37% accuracy!

prodigy ner.batch-train ner_nome pt_core_news_sm --output /model --eval-split 0.5 --label PER --batch-size 2

This is the result:

> Using 1 labels: PER
> 
> Loaded model pt_core_news_sm
> Using 50% of accept/reject examples (1217) for evaluation
> Using 100% of remaining examples (1516) for training
> Dropout: 0.2  Batch size: 2  Iterations: 10
> 
> 
> BEFORE     0.061
> Correct    40
> Incorrect  617
> Entities   5345
> Unknown    530
> 
> 
> #          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
> 01         7881.630   86         571        32921      0          0.131
> 02         5894.873   205        452        38230      0          0.312
> 03         5179.047   121        536        32562      0          0.184
> 04         4532.524   196        461        34669      0          0.298
> 05         4177.853   207        450        30510      0          0.315
> 06         3960.036   224        433        30300      0          0.341
> 07         3740.007   222        435        30841      0          0.338
> 08         4074.908   239        418        34695      0          0.364
> 09         3990.998   238        419        35026      0          0.362
> 10         3866.311   244        413        31125      0          0.371
> 
> Correct    244
> Incorrect  413
> Baseline   0.061
> Accuracy   0.371
> 
> Model: C:\model
> Training data: C:\model\training.jsonl
> Evaluation data: C:\model\evaluation.jsonl

When I annotate, Prodigy suggests things like periods and parenthesis and numbers as names still! Am I doing something wrong? What can I do to improve my result?

As a separate question, I annotated in several sessions. It seems to me that every time it starts from the beginning of the paragrafos.txt file again, as I keep seeing the same sentences over again. Do I have to annotate everything in a single session?

Thank you!

Unfortunately the Portuguese NER model we provide in spaCy probably just isn’t very good on your data. It was trained on Wikipedia texts, using a semi-automatic procedure based on the linked entities. For some problems this works surprisingly OK, but for other problems it’s really not much better than starting from a blank model. I think your problem might be like that. I notice that you’ve only got 40 correct entities in the data, so there’s not really much batch-train can learn from.

I suggest you start by annotating 1000 paragraphs or so with ner.manual. Then you can run the ner.batch-train with the --no-missing flag, which tells it there are no entities that are true that aren’t covered in the annotations. This gives it much more information than the binary annotations, which helps training.

In order to resume annotation, try using the --exclude argument, with the name of your dataset.

Thank you Matthew. I don`t understand why it says there are only 40 correct. I must have annotated over 1000 correct names. I’ll try ner.manual and update here!

Oh, right you are — I misread. It means the model only predicted 40 correctly initially. Hmm.

Maybe you should try starting from a model with just word vectors, for the batch train? You could download some vectors from here: https://fasttext.cc/docs/en/crawl-vectors.html

To convert the vectors into a spaCy model, run:

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pt.300.bin.gz
python -m spacy init-model ./pt_vectors_web_lg --vectors cc.pt.300.bin.gz

This should save you a directory ./pt_vectors_web_lg, with a vocab/ folder with the FastText vectors. You can then pass this path to ner.batch-train.

** Edited to include latest updates

I think I’m going to need a more powerful machine to run this!

ValueError: 2400000051 exceeds max_bin_len(2147483647)

UPDATE1:

@honnibal, so MessagePack has a limit of 2GB per message, and the vectors file I generated from FastText is 2.4GB. Hence the error message. I saw in your other post that you suggested modifying the script to include vocab=False. But I have no idea how to do this. What should I do to make Prodigy run the batch-train on the large vector file?

UPDATE2:

This is driving me crazy! Since my last update, I decided to create a VM in GCloud and install Prodigy there. I thought having a more powerful machine would help…

Anyway, I again imported the FastText model, converted it, imported the annotations DB from my local machine, and tried to run ner.batch-train. Now I’m getting a KeyError!

python -m prodigy ner.batch-train ner_pt pt_vectors_web_lg --output /model --eval-split 0.5 --label PER
Using 1 labels: PER
Loaded model pt_vectors_web_lg
Using 50% of accept/reject examples (1374) for evaluation
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/rogerio_bromfman/env/lib/python3.5/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/rogerio_bromfman/env/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/rogerio_bromfman/env/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/rogerio_bromfman/env/lib/python3.5/site-packages/prodigy/recipes/ner.py", line 426, in batch_train
    examples = list(split_sentences(model.orig_nlp, examples))
  File "cython_src/prodigy/components/preprocess.pyx", line 38, in split_sentences
  File "cython_src/prodigy/components/preprocess.pyx", line 143, in prodigy.components.preprocess._add_tokens
KeyError: 12

Please help!

Looking at the post below, I'm getting exactly the same error. Initially I thought it had to do with the limitation of my machine, but now it seems that it is some limitation with msgpack.

On that post, you suggested passing vocab=False to the model.to_disk()

I looked at the recipe for batch-train and also at the source code for prodigy in python and couldn't figure out where I needed to change.

Any help would be much appreciated!

Thanks!

The complete error log I get is the following:

python -m prodigy ner.batch-train ner_nome pt_vectors_web_lg --output /model --eval-split 0.5 --label PER --batch-size 1
Using 1 labels: PER

Loaded model pt_vectors_web_lg
Using 50% of accept/reject examples (1374) for evaluation
Traceback (most recent call last):
  File "C:\Users\Rogerio\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Rogerio\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Rogerio\Python VENV\lib\site-packages\prodigy\__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "C:\Users\Rogerio\Python VENV\lib\site-packages\plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "C:\Users\Rogerio\Python VENV\lib\site-packages\plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\Users\Rogerio\Python VENV\lib\site-packages\prodigy\recipes\ner.py", line 411, in batch_train
    model = EntityRecognizer(nlp, label=label, no_missing=no_missing)
  File "cython_src\prodigy\models\ner.pyx", line 165, in prodigy.models.ner.EntityRecognizer.__init__
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 274, in _reconstruct
    y = func(*args)
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 273, in <genexpr>
    args = (deepcopy(arg, memo) for arg in args)
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "C:\Users\Rogerio\Anaconda3\lib\copy.py", line 274, in _reconstruct
    y = func(*args)
  File "vectors.pyx", line 24, in spacy.vectors.unpickle_vectors
  File "vectors.pyx", line 428, in spacy.vectors.Vectors.from_bytes
  File "C:\Users\Rogerio\Python VENV\lib\site-packages\spacy\util.py", line 490, in from_bytes
    msg = msgpack.loads(bytes_data, raw=False)
  File "C:\Users\Rogerio\Python VENV\lib\site-packages\msgpack_numpy.py", line 187, in unpackb
    return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
  File "msgpack\_unpacker.pyx", line 200, in msgpack._unpacker.unpackb
ValueError: 2400000051 exceeds max_bin_len(2147483647)

Hi Rogerio,

Sorry for the trouble. We need to figure out another solution for serialising the vectors, due to that msgpack limit. In the meantime, perhaps you could try adding this argument to the init-model command:

python -m spacy init-model ./pt_vectors_web_md --vectors cc.pt.300.bin.gz --prune-vectors 20000

This will limit the vectors to 20,000 rows, so that it’s much smaller. Words outside of the 20,000 you’re retaining will be mapped to their nearest vector.

Hi Matt,

First of all, the suggestion to limit the number of vectors to 20k did work. At least in the sense that now I can train the FastText model for NER. However, I’m running into all sorts of other issues. I’m pretty certain I’m doing something wrong…

So when I first ran ner.teach, after recreating the model with fewer vectors and adding the NER pipe (as per another post by Ines), I got the following message:

?  ERROR: Can't find label 'PER' in model pt_vectors_web_md
ner.teach will only show entities with one of the specified labels. If a
label is not available in the model, Prodigy won't be able to propose
entities for annotation. To add a new label, you can specify a patterns file
containing examples of the new entity as the --patterns argument or
pre-train your model with examples of the new entity and load it back in.

This made sense, since it was a new model, and my annotations were previously done on a different one. So I ran ner.manual first to add the tags and manually process a number of sentences. I did quite a few (around 400) manual annotations and then ran ner.batch-train. The accuracy came out at just around 20%, so clearly I needed moree annotations.

I ran ner.teach and annotated another 1000 sentences. I was discouraged to se that the suggestions were really way off. Only about 4-5% of the suggestions were relevant. The rest were things like puctuation marks and numbers suggested as persons names with a score of 1.00! I ran ner.manual again to see if I could improve on the predictions. After a few iterations, and after adding a few more tags for things like street names, dates and company names, I finally managed to get to an accuracy of 45%. This was my last output:

C:\Users\Rogerio\Python\Python36>python -m prodigy ner.batch-train nomes_ner pt_vectors_web_md --output nomes_model --label "PER, RUA, CID, DAT, COM" --eval-split 0.2 --batch-size 3 --n-iter 6

Using 5 labels: PER, RUA, CID, DAT, COM

Loaded model pt_vectors_web_md
Using 20% of accept/reject examples (283) for evaluation
Using 100% of remaining examples (1692) for training
Dropout: 0.2  Batch size: 3  Iterations: 6


BEFORE     0.022
Correct    4
Incorrect  182
Entities   3705
Unknown    3701


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         2551.374   33         153        1106       0          0.177
02         1497.026   62         124        1187       0          0.333
03         1394.799   57         129        1358       0          0.306
04         1269.312   69         117        1262       0          0.371
05         1198.393   69         117        1200       0          0.371
06         1178.639   83         103        1171       0          0.446

Correct    83
Incorrect  103
Baseline   0.022
Accuracy   0.446

Model: C:\Users\Rogerio\Python\Python36\nomes_model
Training data: C:\Users\Rogerio\Python\Python36\nomes_model\training.jsonl
Evaluation data: C:\Users\Rogerio\Python\Python36\nomes_model\evaluation.jsonl

Immediately after that I ran the following:

python -m prodigy ner.teach nomes_ner pt_vectors_web_md paragrafos2.txt --label "PER, RUA, CID, DAT, COM" --exclude nomes_ner

And got the same error message as before, saying that it can't find label 'PER'!! This is driving me crazy!!

Hopefully you can help me with a few questions:

Firstly, why do I get this message even after I just annotated and trained successfully?

Second, Why is my precision so low even after 3000+ annotations? For context, the training sentences I’m using were pulled from a batch of 600 legal contracts that were in pdf-image format. I used Google’s Cloud Vision API to convert to text, filtered some of the paragraphs away, separated into sentences and saved as a text file (paragrafos2.txt). Is it possible that the language in these contracts is too complex and I’m never going to get a good precision with Spacy? Is it possible that the innacuraciees of the OCR (which I think did a fair job, but not great) are getting in the way?

Any suggestions of what I can do to improve my results?

Thank you so much!!

Did you load in the right model when you ran ner.teach? From the command you posted, it looks like you used the old pt_vectors_web_md model and not the nomes_model you trained and updated? (see output of the training command)

1692 unique examples (excl the data held back for evaluation) is still a fairly low number, especially for training 4 new categories in a very specific domain. So that's definitely a possible explanation. Another question to ask is how well the examples generalise, and how easy it is for the model to learn the labels based on the surrounding context. PER and COM (I assume that's company / organization?) seem pretty straightforward, but I'm not sure about the other ones, what they mean and how they're used in the data.

Ines,

I can’t thank you and Matt enough for the patience you’re having with me!

As you pointed out, I was batch-training one model and then training the empty one afterwards. I’m embarrassed! Obviously training the correct model made all the difference! I just made a further 500 annotations, focused solely on PER (people’s names). I used the following command:

python -m prodigy ner.teach nomes_ner nomes_model paragrafos2.txt --label PER --exclude nomes_ner

Now a lot more of the suggestions were relevant. Some things I didn’t know what was the right way to handle. When it suggested just part of a name, I accepted. When it suggested a name together with some extra words, I ignored. I didn’t want to reject, as the selected part did contain a name, but I didn’t want the model picking up more than the name…

Anyway, after 500 annotations on PER I ran a batch-train again. The results were pretty much the same as before unfortunately…

python -m prodigy ner.batch-train nomes_ner pt_vectors_web_md --output nomes_model --label PER --eval-split 0.2 --batch-size 3 --n-iter 6

    Using 1 labels: PER

    Loaded model pt_vectors_web_md
    Using 20% of accept/reject examples (298) for evaluation
    Using 100% of remaining examples (1753) for training
    Dropout: 0.2  Batch size: 3  Iterations: 6


    BEFORE     0.006
    Correct    1
    Incorrect  164
    Entities   4975
    Unknown    829


    #          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
    01         2157.141   53         112        1230       0          0.321
    02         1553.194   74         91         1226       0          0.448
    03         1397.151   68         97         1464       0          0.412
    04         1398.000   70         95         1563       0          0.424
    05         1320.851   67         98         1404       0          0.406
    06         1294.425   60         105        1610       0          0.364

    Correct    74
    Incorrect  91
    Baseline   0.006
    Accuracy   0.448

    Model: C:\Users\Rogerio\Python\Python36\nomes_model
    Training data: C:\Users\Rogerio\Python\Python36\nomes_model\training.jsonl
    Evaluation data: C:\Users\Rogerio\Python\Python36\nomes_model\evaluation.jsonl

I’ll keep on annotating and hopefully will be able to improve… Question, in this case, should I have called the batch-train on nomes_model or should I have trained it from scratch from pt_vectors_web_md like I did?

Rgarding the last part of your response, the tags I was using were:

PER - peoples names
RUA - street names
CID - cities
DAT - dates
COM - company names

I think that one of the big problems I’ll have is that the streets in Brazil are usually named after people. So the model will have to learn to differentiate when a name is a person or a street. It will have to understand that the street names will usually start with the words “street” or “avenue”, and will typically be followed by a comma and a number. House numbers here are put after the street name.

Anyway, I will do a lot more annotations on PER and try to get the accuracy to 70%+. Then I’ll worry about the other categories.

Again, thanks a lot for responding!

PS: I didn’t mention that I keep getting the RuntimeWarning: cymem.cymem.Pool size changed... error, but I see that more people are getting this so I’m just ignoring it.

Yay, glad it works now!

Okay, this probably explains a lot! You should definitely reject incomplete entities and any other suggestions that are wrong (even if it's painful sometimes if the model almost got it right :wink:). My comment on this thread explains the reasoning behind this in more detail:

So if you've been annotating differently, I'd definitely suggest to convert your existing annotations to gold-standard, pre-train your model from that and then try the binary workflow again starting with a fresh dataset. You could also try adding some patterns when you run ner.teach, to make sure the model sees enough positive examples during annotation. For example, some street names or abstract patterns of possible street names could work well (e.g. any token + - + any token + "avenue").

This is interesting and definitely something I'd keep an eye on! (Also a nice example of why it's always super important to reason about the data and be familiar with both the language and the domain!)

I created a new dataset and started annotating from scratch. Annotating as per your suggestion seems to have done the trick. I’m now at 72% accuracy after 2500 annotations! Thank you!

@honnibal -- just wanted to let you know that I am also having this issue while working with large vectors (200D). Should I raise an issue on the github repository for spacy or prodigy?

@mikeross It would be a spaCy issue, so yes raising it there would be good.

I think this might mitigate the issue: In the ner.batch-train recipe, you can change it so that it doesn’t try to serialize the vectors into the message. The vectors and vocab are static; we only need to be saving out the NER model state after each epoch. So in the recipe, instead of model.nlp.to_bytes(), try this:

            if output_model is not None:
                # nlp.to_bytes() uses a tonne of memory if we have vectors.
                # Instead just serialize the NER model.
                with model.nlp.use_params(model.optimizer.averages):
                    model_to_bytes = nlp.get_pipe('ner').to_bytes(vocab=False)
                best = (stats['acc'], stats, model_to_bytes)

Then after training, load back the best model with model.nlp.get_pipe('ner').from_bytes(best_model). The recipe source should be in your virtual environment, or you can clone the recipes repo here: https://github.com/explosion/prodigy-recipes

I’ve applied this change to Prodigy master already, so it should be out with the next version.

After about 10000 annotations, and trying three different word embedding models (FastText, Word2Vec GBOW and Skip-Gram), I’m back to around 40% accuracy. Worst is that with ner.teach, about a quarter of the suggestions are single periods (.) and commas. Another good part is numbers. It only gets one right guess in 150-200 annotations. I want to write a pattern to say that names should be composed of at least two words, no punctuation or numbers. But since I can’t use RegEx in the pattern, I don’t know how to write it. Could you help me write this pattern please?

I'm not sure this would be a good fit for the patterns, since those should really be abstract descriptions for potential entity candidates. You could definitely write patterns using attributes like "is_alpha": true, "is_punct": false etc., but this would suggest any combination of tokens that match those descriptions, which is not what you want.

If you've collected this many annotations and experimented with different vectors, I think what makes much more sense is to take a step back and think about what this result actually means, and adjust your strategy accordingly.

If I read your comments correctly, you got pretty promising results in the beginning and it got worse with more annotations? When you restarted the annotation session, did you keep updating the model you trained in the previous step with ner.batch-train? (If you started all of your sessions with a blank model, this might explain why the quality of suggestions didn't get better over time.)

It's also possible that the model just doesn't converge and manage to get over the cold-start problem. Maybe the data just isn't suitable for the semi-automatic approach, or maybe there's something about Portuguese that makes it harder. Maybe there's just no way around collecting a set of gold-standard manual annotations for your use case, even if that takes a bit longer and is less convenient. A useful next experiment would definitely be to run a recipe like ner.manual and label a bunch of examples fro scratch with all labels. Then run ner.batch-train again with the --no-missing flag (to take advantage of the fact that you know that your data is gold-standard) and check out the results.