Train POS tagger after custom tokenization

mariastefan · July 17, 2020, 4:48pm

Hello, I am a newbie in spaCy and I am struggling with the training of the POS tagger.

I am trying to train the POS tagger after customizing the tokenizer.

For example the tokenization of the text
Il est culotté celui-là.
is now
['Il', 'est', 'culotté', 'celui-là', '.']

rather than the original one :
['Il', 'est', 'culotté', 'celui', '-', 'là', '.']

My problem is that nlp.update() doesn't seem to consider my customized tokenizer, since I can't annotate 'celui-là' as one token, but as 3 :
TRAIN_DATA = [
('celui-là', {'tags': ['PRON','PUNCT', 'PRON']})
]

But it should be :

TRAIN_DATA = [
   (' celui-là', {'tags': ['PRON']})
]

However we can see that in the output the customized tokenizer is applied, so my conclusion is that I am training the tagger before applying the custom tokenizer.

Here are the code and output :

gist.github.com

https://gist.github.com/mariastefan/e664b279e735916d8e196467769874b5

1. train_new_tagger.py

import random
from pathlib import Path
import spacy
import sys
import os
sys.path.append('.')
from resolution_coreferences_pronominales.custom_model_training.custom_tokenizer import nlp_loader

output_dir = os.path.abspath(os.path.dirname(__file__)) + '/customPOS/'
# base_model = 'fr_core_news_sm'

This file has been truncated. show original

2. custom_tokenizer.py

import fr_core_news_sm
import os
from spacy.matcher import Matcher
import json
from spacy.language import Language
from spacy.tokens import Doc
import spacy

json_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) + \
            '/custom_model_training/custom_model_params/compound_words.json'

This file has been truncated. show original

3. Output 1

# This is the output when I tag 'celui-là' with 'PRON','PUNCT','PRON' (rather than what I want to achieve : just 'PRON')

Saved model to /home/maria/Documents/resolution-des-coreferences-pronominales/resolution_coreferences_pronominales/custom_model_training/customTokenizerModel/
{'parser': 0.0, 'tagger': 0.0, 'ner': 0.0}
{'ner': 0.0, 'parser': 0.0, 'tagger': 0.0}
{'parser': 0.0, 'ner': 0.0, 'tagger': 0.0}
{'tagger': 0.0, 'parser': 0.0, 'ner': 0.0}
{'tagger': 0.0, 'parser': 0.0, 'ner': 0.0}
{'tagger': 0.0, 'parser': 0.0, 'ner': 0.0}
{'parser': 0.0, 'ner': 0.0, 'tagger': 0.0}

This file has been truncated. show original

There are more than three files. show original

Do you know how to first apply my modifications of the tokenizer before the training of the tagger so I can train it with the right tokens ?

Thank you.

mariastefan · July 23, 2020, 2:22pm

Any suggestions ? I am really stuck
Any ideas would be very apreciated.
Thank you.

Topic		Replies	Views
Is it possible to train a tagger with custom labels? usage , spacy , solved , pos	4	1509	February 26, 2019
Training after annotating with custom tokenizer spacy , transformers , training	3	558	November 8, 2023
Migration from spaCy 2.3 to 3.x + Annotating data in prodigy usage , spacy	1	453	August 29, 2021
Custom POS tag model and errors spacy , custom , pos	3	2345	January 16, 2019
Training POS Tager for Indonesian Language usage , spacy , pos	5	1276	November 20, 2019

Train POS tagger after custom tokenization

Related topics