Spacy NER - tokeniser for camembert-base

dad766 · February 27, 2023, 10:11pm

Hello,
I am trying to finetune a NER base on camembert-base model
, so I specified it in my following cfg file:

[paths]
train = tmp/activity/train.spacy
dev = tmp/activity/dev.spacy
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "fr"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "camembert-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 10000
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

However, during the annotation phase in Prodigy I do not specifically use the tokenizer specified in the huggingFace model, namely :

tokenizer = CamembertTokenizer.from_pretrained("camembert-base")

But from what I understand from the documentation: transformers-tokenizers I should do it.

But I don't understand where to find the files:

--tokenizer-vocab
-F transformers_tokenizers.py

How to find them ?

koaning · February 28, 2023, 9:27am

The docs that you mention refer to this custom recipe found in a file called transformers_tokenizers.py on Github. Is this the recipe you were looking for? Let me know if that's not the case. This recipe is meant for users who want to train their own huggingface models without spaCy in the mix.

Note that if you're interested in using spaCy with transformers that you can also just use the standard ner.manual recipe. I wrote a bit more context about that here:

In case it's relevant: more information on the user of transformer models in spaCy can be found on this Github FAQ.

dad766 · February 28, 2023, 1:10pm

My goal is to use spacy to fine-tune camembert-base model from HuggingFace. But I am not comfortable with these notions.

Train command I use is the following:

python -m spacy train tmp/train_for_spacy/functions/config.cfg --gpu-id 0

config.cfg being the config file above referring to the huggingFace camembert-bert model.

Am I wrong in thinking fine-tune camembert-base using spacy?

koaning · February 28, 2023, 1:25pm

My goal is to use spacy to fine-tune camembert-base model from HuggingFace.

Is it your goal to use camambert inside of spaCy? That is to say; you want to use it as a component inside of a spaCy pipeline and train other components, like NER, on top of it? If that is the case, you can just use ner.manual and spaCy will handle all the token conversions for you.

Or do you want to finetune camambert via huggingface? In that case you may want to use this custom recipe on Github instead because in this case you won't be using spaCy to handle the token alignment.

Train command I use is the following:

Are you experiencing an error message while training? If so, feel free to share a small subset of the data and I'll try to reproduce it.

dad766 · March 3, 2023, 8:43am

Is it your goal to use camambert inside of spaCy?

yes and it works well. I don't have any bugs.

But what I notice is that the more I increase the number of data in my train (I go from 500 to 1500) and the more my validation score increases (I go from 50% to 54%) but my test score decreases (I go from 60% to 50%).
I'm looking for some tracks and that's why I was wondering about tokenization. Any ideas ?

koaning · March 3, 2023, 9:30am

(I go from 500 to 1500) and the more my validation score increases (I go from 50% to 54%) but my test score decreases (I go from 60% to 50%).

There can be a number of reasons for this. One reason that I've encountered is that there may be statistical differences between the train/test/dev sets as you're annotating. If these sets aren't sampled randomly it could be that your test set contains more "hard examples" while your train/dev set contain more "easy" examples. This is speculation, but it's something that might be happening in your dataset. The only way to know for sure is to dive deeper into your data. Are there specific entities that your pipeline gets wrong?

Food for Thought

I figured it might be nice to share this PyData video because it might offer some inspiration. It's about a spaCy project that I did a few years ago where I try to detect programming languages in text. It turned out that my setup was slightly biased and that I was judging my model on easy languages (like Python and Javascript) while ignoring harder languages (like Go).

As the video helps explain, I learned a lot by looking at the mistakes that the model makes. Then, by iterating on both the model and the data, I was able to make my annotation process much more reliable.

dad766 · March 8, 2023, 9:18am

Yes typically for the detection of the name and surname of the caller and the name and surname of the signatory.

The caller is the person on the phone with the agent.
The signatory is the person who will sign the contract, and who is not necessarily the person on the phone. It is often the person responsible for the person on the phone.
Other people can be mentioned in the conversation.

Context:

The telephone transcription is not of good quality, and often words or parts of sentences are missing. Some words can be confused with others.
In addition, a person's name will be written differently in the same conversation depending on the understanding of the Speech To Text.
All this adds to the difficulty for the model.

Here are the errors of the model that I oppose:

Extraction of several names from the caller
Ex : name and surname of the caller : [mister perrache, damien perrache, perrache, mister perrot, daniel perrache, madame bernard, madame floriane]. Already how to go up that only one name because in the end there is only one caller. In this case I imagined to go back up the word the most often quoted, that is to say "perrache". What would you have done in this case.
The same name is found as both caller and signatory:
Ex: Name and first name signatory: [Mr. perrache, Mrs. florence]. In this case I would imagine keeping Mr. Perrache. The caller can be the signatory, but it is rare. And it is false in this case because there is no signatory explicitly quoted.
for the signatory I have a very low precision of 0.20, but there is a big imbalance in my train because the name of the signatory appears in only 20% of the conversations. So I don't see any other solution than to re-equilibrate the classes in my train. But it means spending a lot of time to identify in the conversations where the signer's name appears (by text pattern via prodigy) to then label them. Is there any other method than having to add more and more examples in the train?

And more generally, when the expected results are concerned, what improvement should be made?

systematically adding examples in the train (this is often long and costly in my case)
make data augmentation. How to do for a NER ? Are there any tools that can generate examples from a corpus of examples?
How to play with the parameters of the model ? If yes, which ones? How to select them?
other tips ?

koaning · March 8, 2023, 9:47am

I'll respond with some ideas that pop into my mind.

The telephone transcription is not of good quality, and often words or parts of sentences are missing.

What I'm about to suggest is "a trick". I've heard people use it, but the only way to know for sure if it works is to actually try it. I've heard people take the audio segment and speed it up/slow it down to create multiple audio tracks based on the same origin. Next, you can feed it to the same transcription service. You'd effectively end up with multiple transcriptions from the same audio track which allows you to figure out which parts of the transcriptions are reliable and which ones aren't.

Again, you'd have to try it out to see if it helps, but it might be worth the exercise if the transcriptions are really bad.

Extraction of several names from the caller

Have you considered making a two-step system? If the NER model detects mister perrache instead of perrache then it seems like it'll be easy to take the name entity and to pass it through a rule-based system that removes the title that comes before the name.

There may also be other places where a two-step approach could help. I can imagine that detecting a name first and detecting the role after is easier than doing both at the same time.

for the signatory I have a very low precision of 0.20, but there is a big imbalance in my train because the name of the signatory appears in only 20% of the conversations

You mention that a low precision is an issue, does it have good recall?

I agree that it would help to get more examples. You could try and re-use your trained NER pipeline for this. You can apply it to your unlabelled set to get a subset where it's likely to appear. Those examples might deserve priority. Using patterns would also be a good idea here.

systematically adding examples in the train (this is often long and costly in my case)

It can be a lot of work for sure. But machine learning algorithms really need a high quality dataset to be reliable.

make data augmentation. How to do for a NER ? Are there any tools that can generate examples from a corpus of examples?

In my experience it's very hard to make data augmentation do the thing you want it to do. Yes, you might get more training data, but it might be training data that's unlike the data that you'll see in production. For example, you could introduce typos to make your model more robust against that. But if the transcription service never produces typos, you might skew the model in a direction that won't help.

How to play with the parameters of the model ? If yes, which ones? How to select them?

I would delay hyperparameter search until you're confident that your dataset isn't the issue anymore. I usually like to ask myself: "what would improve the model more: if I spend an hour trying more settings or if I annotate more data in an hour?".

To help with this, you might want to check out the train-curve recipe if you haven't seen it already.

other tips ?

Thinking about two-step approaches might be a good investment. They might make the ML task easier and they allow you to re-use domain knowledge to make the task more reliable.

I also think it can be a good exercise to think about ML tricks to find a subset of interest. Patterns and trained models can really help you find examples that you want to include in your labelled dataset.

dad766 · March 8, 2023, 10:43am

here are the stats I get :

  "p":0.2058823529,
  "r":0.5833333333,
  "f":0.3043478261

Ok, so in a first time I do a detection of the person's name. In this case, can't I directly use the base model without fine-tuned it ?
The question is, wouldn't I have a better detection of names with the base model even if the type of input is very different from what I have in my train. Or is it more important to have a model that has been trained on inputs that really match what I would have as input ?
And then I train a classifier with 3 classes:

caller
signatory
other

Another question:
As the conversations are long (15000 words on average) and as there is a need to have context (for example knowing that the agent asked for the signer's name) I have a breakdown of the conversation by blocks of 200 words. If I try to optimize the size of the example to the strict minimum, would I have better predictions, even if in this case the size of the text blocks would be different between the train and the inference?

koaning · March 8, 2023, 11:15am

here are the stats I get :

Ah yeah. The recall isn't great, but at least it's much higher than the precision.

The question is, wouldn't I have a better detection of names with the base model even if the type of input is very different from what I have in my train. Or is it more important to have a model that has been trained on inputs that really match what I would have as input ?

Ah, I should've been more clear. From your response I had gathered that you had an entity label for both the name and surname. In that case I would recommend the two-step approach where step 1 is "detect the full name" and step 2 is "split the name from the surname".

You can use a base spaCy model for this, but you can also use the spaCy model and fine-tune the "person" label to your dataset.

And then I train a classifier with 3 classes:

Yep! That's certainly an approach that I might try.

If I try to optimize the size of the example to the strict minimum, would I have better predictions, even if in this case the size of the text blocks would be different between the train and the inference?

This is very hard to say upfront. You'd always have to check by running an experiment. In general though, I'd try to have your evaluation data mimic the real life use-case as closely as possible. You're allowed to try many different techniques on your train data. But you want to be careful that you don't change the test data just to get a better accuracy number while accidentally making it less like reality.

koaning · March 8, 2023, 11:19am

I guess a final hint that came into my mind: is it possible to worry about just a single entity for now? I'm not sure what your application is, but sometimes you can choose to solve a smaller problem first.

Do you really need all the names right away to provide value? What about just making a model that can confirm if there is a signatory? Would that be sufficient for a business case?

I'm asking these questions mainly to spark some inspiration, it's certainly possible that I'm skipping over something here. But sometimes you can choose to ignore a label. If only for the time being.

dad766 · March 9, 2023, 5:03pm

perfect I can increase my score by segmenting the treatment in 2. Thanks

But I have a problem with the spancat. I have a score that remains at 0 with more than 585 training docs and 194 evaluation docs. I tried with the "standard" model and using "camembert-base".

While in the video tutorial :

It manages to form already a model with about 60% of scoring with only about 20 samples. And if I reproduce what he does with my datas and the delta that the language is French and not English. I still get 0

Here is the 2 command I test when annotating:

python3 -m prodigy spans.manual train blank:fr data/train.jsonl --label "Intentions exprimées"
python3 -m prodigy spans.manual train fr_dep_news_trf data/train.jsonl --label "Intentions exprimées"

Here is what I get after analysis of the data via "debug-data":

============================ Data file validation ============================
proxies= None
token= None
f= vocab_file file_path= https://huggingface.co/camembert-base/resolve/main/sentencepiece.bpe.model
f= tokenizer_file file_path= https://huggingface.co/camembert-base/resolve/main/tokenizer.json
f= added_tokens_file file_path= https://huggingface.co/camembert-base/resolve/main/added_tokens.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
f= special_tokens_map_file file_path= https://huggingface.co/camembert-base/resolve/main/special_tokens_map.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
f= tokenizer_config_file file_path= https://huggingface.co/camembert-base/resolve/main/tokenizer_config.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
unres= ['added_tokens_file', 'special_tokens_map_file', 'tokenizer_config_file']
files= {'vocab_file': '/home/cache/transformers/dbcb433aefd8b1a136d029fe2205a5c58a6336f8d3ba20e6c010f4d962174f5f.160b145acd37d2b3fd7c3694afcf4c805c2da5fd4ed4c9e4a23985e3c52ee452', 'tokenizer_file': '/home/cache/transformers/84c442cc6020fc04ce266072af54b040f770850f629dd86c5951dbc23ac4c0dd.8fd2f10f70e05e6bf043e8a6947f6cdf9bb5dc937df6f9210a5c0ba8ee48e959'}
Some weights of the model checkpoint at camembert-base were not used when initializing CamembertModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: fr
Training pipeline: transformer, spancat
585 training docs
194 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train a new pipeline (585)

============================== Vocab & Vectors ==============================
ℹ 105167 total word(s) in the data (5067 unique)
⚠ 859 misaligned tokens in the training data
⚠ 292 misaligned tokens in the dev data
ℹ No word vectors present in the package

============================ Span Categorization ============================

Spans Key   Labels
---------   ------------------------
sc          {'Intentions exprimées'}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
⚠ No examples for texts WITHOUT new label 'Intentions exprimées'
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ℹ Span characteristics for spans_key 'sc'
ℹ SD = Span Distinctiveness, BD = Boundary Distinctiveness

Span Type              Length   SD     BD
--------------------   ------   ----   ----
Intentions exprimées   9.84     0.66   0.87
--------------------   ------   ----   ----
Wgt. Average           9.84     0.66   0.87

ℹ Over 90% of spans have lengths of 1 -- 21 (min=1, max=57). The most
common span lengths are: 2 (1.47%), 3 (2.01%), 4 (4.28%), 5 (6.02%), 6 (7.62%),
7 (7.35%), 8 (7.09%), 9 (7.35%), 10 (7.62%), 11 (7.35%), 12 (5.61%), 13 (4.68%),
14 (4.68%), 15 (2.67%), 16 (2.67%), 17 (2.54%), 18 (3.21%), 19 (2.27%), 20
(1.87%), 21 (2.54%). If you are using the n-gram suggester, note that omitting
infrequent n-gram lengths can greatly improve speed and memory usage.
⚠ Spans may not be distinct from the rest of the corpus
⚠ Boundary tokens are not distinct from the rest of the corpus
✔ Good amount of examples for all labels

================================== Summary ==================================
✔ 4 checks passed
⚠ 6 warnings

My results with camembert-base :

=========================== Initializing pipeline ===========================
proxies= None
token= None
f= vocab_file file_path= https://huggingface.co/camembert-base/resolve/main/sentencepiece.bpe.model
f= tokenizer_file file_path= https://huggingface.co/camembert-base/resolve/main/tokenizer.json
f= added_tokens_file file_path= https://huggingface.co/camembert-base/resolve/main/added_tokens.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
f= special_tokens_map_file file_path= https://huggingface.co/camembert-base/resolve/main/special_tokens_map.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
f= tokenizer_config_file file_path= https://huggingface.co/camembert-base/resolve/main/tokenizer_config.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
unres= ['added_tokens_file', 'special_tokens_map_file', 'tokenizer_config_file']
files= {'vocab_file': '/home/cache/transformers/dbcb433aefd8b1a136d029fe2205a5c58a6336f8d3ba20e6c010f4d962174f5f.160b145acd37d2b3fd7c3694afcf4c805c2da5fd4ed4c9e4a23985e3c52ee452', 'tokenizer_file': '/home/cache/trans
formers/84c442cc6020fc04ce266072af54b040f770850f629dd86c5951dbc23ac4c0dd.8fd2f10f70e05e6bf043e8a6947f6cdf9bb5dc937df6f9210a5c0ba8ee48e959'}
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'spancat']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE
---  ------  -------------  ------------  ----------  ----------  ----------  ------
  0       0        1131.02        147.16        0.07        0.04        4.28    0.00
  3     200      114324.40      16067.41        0.00        0.00        0.00    0.00
  6     400           0.00         97.01        0.00        0.00        0.00    0.00
  9     600           0.00         98.00        0.00        0.00        0.00    0.00
 13     800           0.00        100.00        0.00        0.00        0.00    0.00
 16    1000           0.01        106.01        0.00        0.00        0.00    0.00
 19    1200           0.00         93.00        0.00        0.00        0.00    0.00
 23    1400           0.00        106.00        0.00        0.00        0.00    0.00
 26    1600           0.00        103.99        0.00        0.00        0.00    0.00
✔ Saved pipeline to output directory

and my config file.


[paths]
train = tmp/train_for_spacy/train.spacy
dev = tmp/train_for_spacy/dev.spacy
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "fr"
pipeline = ["transformer","spancat"]
batch_size = 32
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "camembert-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

I confess I don't understand where my problem comes from because I am in a case quite close to the tutorial. I am even in a simpler case because I have only one label.
Moreover the spans are often easily identifiable because they often start with :
"Je souhaiterais ..."
"J'aimerais ...."
"Je voudrais ..."

I don't understand why the template can't capture that.
I can send a sample of data if it helps (just tell me where). But right now I admit to being pretty desperate.

koaning · March 10, 2023, 2:00pm

That's indeed pretty strange. I'll gladly have a look though.

Could you share maybe 20 examples here? You don't have to send private data, but any dataset that you are able to generate that has the same behavior will help me to try and reproduce the error on my machine. It feels like this might be an error on the spaCy side of things, but the only way to know for sure is to give this a spin myself.

That said. This line is making me wonder if there's perhaps a configuration gone awry:

Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.

dad766 · March 13, 2023, 10:34am

I tested several things and I can get a result. It's not great but it's progressing :). Here is what I did:

limit the size of the annotated sentences. I limit the size to 10 words max while doing the maximum to limit the size.
increase the number of samples.
On everything I tested it's what works the best.
Here are the results I managed to get. I put 2 results to show the progression.
Here are the results with 300 examples:

Here are the results with 600 examples :

=========================== Initializing pipeline ===========================e[0m
proxies= None
token= None
f= vocab_file file_path= https://huggingface.co/camembert-base/resolve/main/sentencepiece.bpe.model
f= tokenizer_file file_path= https://huggingface.co/camembert-base/resolve/main/tokenizer.json
f= added_tokens_file file_path= https://huggingface.co/camembert-base/resolve/main/added_tokens.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
f= special_tokens_map_file file_path= https://huggingface.co/camembert-base/resolve/main/special_tokens_map.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
f= tokenizer_config_file file_path= https://huggingface.co/camembert-base/resolve/main/tokenizer_config.json
Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
unres= ['added_tokens_file', 'special_tokens_map_file', 'tokenizer_config_file']
files= {'vocab_file': '/home/cache/transformers/dbcb433aefd8b1a136d029fe2205a5c58a6336f8d3ba20e6c010f4d962174f5f.160b145acd37d2b3fd7c3694afcf4c805c2da5fd4ed4c9e4a23985e3c52ee452', 'tokenizer_file': '/home/cache/transformers/84c442cc6020fc04ce266072af54b040f770850f629dd86c5951dbc23ac4c0dd.8fd2f10f70e05e6bf043e8a6947f6cdf9bb5dc937df6f9210a5c0ba8ee48e959'}
e[38;5;2m✔ Initialized pipelinee[0m
e[1m
============================= Training pipeline =============================e[0m
e[38;5;4mℹ Pipeline: ['transformer', 'spancat']e[0m
e[38;5;4mℹ Initial learn rate: 0.0e[0m
E    #       LOSS TRANS...  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  -------------  ------------  ----------  ----------  ----------  ------
  0       0        8632.71       1236.75        0.13        0.06        6.83    0.00
  4     200      112424.17      16173.88        0.00        0.00        0.00    0.00
  8     400           0.00        556.94        0.00        0.00        0.00    0.00
 13     600           8.96        535.38        0.00        0.00        0.00    0.00
 17     800          72.50        516.99        0.00        0.00        0.00    0.00
 22    1000          76.54        337.19        6.93       30.77        3.90    0.07
 26    1200          21.92        226.70        2.83       42.86        1.46    0.03
 31    1400          17.84        198.35        6.54       77.78        3.41    0.07
 35    1600           4.95        174.61        5.61       66.67        2.93    0.06
 40    1800           0.04        171.06        6.48       63.64        3.41    0.06
 44    2000           0.00        169.02        6.48       63.64        3.41    0.06
 49    2200           0.00        173.00        6.48       63.64        3.41    0.06
 54    2400           1.57        177.17        8.14       56.25        4.39    0.08
 58    2600           9.14        167.11       10.79       36.11        6.34    0.11
 63    2800           9.40        176.34        4.63       45.45        2.44    0.05
 67    3000          14.23        170.97        3.72       40.00        1.95    0.04
 72    3200          23.28        164.27        1.88       25.00        0.98    0.02
 76    3400           5.68        163.99        4.65       50.00        2.44    0.05
 81    3600          14.72        157.71        5.53       50.00        2.93    0.06
 85    3800           1.66        150.67        7.14       42.11        3.90    0.07
 90    4000           2.70        140.81        5.43       37.50        2.93    0.05
 94    4200           1.42        142.18        3.76       50.00        1.95    0.04
 99    4400           0.01        150.01        3.67       30.77        1.95    0.04
103    4600           0.02        139.02        3.69       33.33        1.95    0.04
108    4800           0.00        148.00        3.69       33.33        1.95    0.04
112    5000           0.00        140.00        3.72       40.00        1.95    0.04
117    5200           0.00        142.99        3.72       40.00        1.95    0.04
121    5400           0.00        144.00        4.57       35.71        2.44    0.05
126    5600           0.00        143.98        4.57       35.71        2.44    0.05
130    5800           0.01        143.94        5.38       33.33        2.93    0.05

The drawback here is that I have a very bad recall. This is due to the large variety of possible annotated sentences. Given the progression of recall I'm afraid that it would take several thousand examples to get a good score.

I think that adding rules with Spacy would allow me to limit the number of bad examples. I have to dig this aspect, do you have a tutrial to recommend me?

To finish, a file with 20 samples. Could you please delete the file once downloaded
for_help_prodigy.jsonl (252.1 KB)

koaning · March 13, 2023, 11:41am

I figured I'd glance over some of your examples, and I may have found something that helps explain your issue.

Example A

This is from a screenshot in Prodigy:

Example B

This is from a screenshot in Prodigy:

Comparison

When I translate the highlighted elements of both examples then it seems that example A highlights "j'aimerais donc du vote par correspondance et du electronique" meaning "I would therefore like postal and electronic voting" and that example B highlights "vote par correspondance et du electronique" meaning "voting via postal and electronic".

While the translations might be slightly off, it does seem clear that example A includes the "i would like" part of the text while example B skips it ("je voudrais" is not highlighted). It's possible that I'm zooming in on one example pair here, but this inconsistency might help explain the results that you're seeing.

Figured I'd double-check and ask, when should the highlighted span include the "i would like" part?

dad766 · March 13, 2023, 11:55am

It's great that you point out this example because this is one of the things I corrected by reducing the size of the annotated sentences. And that's what allowed me to get a score different from 0. The data in the file is when I was consistently getting a score of 0.

I think that the beginning of the sentence "I would like", "I wish", "I want" is not an important information to bring up, it's more what's after that really carries information. On the other hand, it is important to make the spancat understand that it is often the words that follow that are interesting. But in order to limit the size of the annotated sentence I have voluntarily excluded all the part of the sentences that did not contain relevant information. This works well for precision, but not for recall.
Hence my idea to add rules because I am afraid that otherwise the number of examples needed to train the spancat would be too big.

What do you think about it ?

dad766 · March 15, 2023, 1:56pm

I also asked myself the following question:
Knowing that I have only one label so no overlapping entity, would it not be relevant to use DEFAULT_SPANCAT_SINGLELABEL_MODEL as defined here:

the problem is that I don't see how to add it in my config file

koaning · March 15, 2023, 2:23pm

Hence my idea to add rules because I am afraid that otherwise the number of examples needed to train the spancat would be too big.

My primary concern would be inconsistent labels, as long as those are in there you'll have a bad time training models because there is no solid definition for ground truth.

Generally, I fear that there's no way around needing a large-enough high-quality dataset. But ... there is a trick that came to mind during my morning walk today that might help. Have you considered investigating the noun chunks in your text. There may be an opportunity to re-use a trick I've mentioned here:

The example listed there is for Chinese, but I imagine it could work for French too. You could first make a dataset that contains all the relevant grammatical chunks (mainly noun-chunks suffice, only the data can tell) and then you might be able to annotate the ones that are a "intention". The goal would be to annotate examples to help populate a patterns file and I can imagine that there aren't that many intentions that people are asking for. You might be able to emunerate through a lot of them.

I'm just mentioning this technique because it might help but the only way to know for sure is to try it out. It's helped me in the past, although that was for more NER-types of models.

the problem is that I don't see how to add it in my config file

I'll gladly any answer that you might have on Prodigy, but it would be fair to say that my knowledge on spaCy details is somewhat limited. So just to mention: have you seen our spaCy discussion forum? It's where the spaCy team-members hang out and they usually give very in-depth answers on spaCy in more detail than we might provide here. For example, here's a bunch of questions related to spancat. If you have a detailed spancat questions, it would make sense to ask it there if it's unrelated to Prodigy.

Topic		Replies	Views
Trouble training for Portuguese usage , ner , spacy	15	2515	December 6, 2018
ner.train number of examples usage , ner	8	1954	August 3, 2018
Training a grammar tool usage , textcat	24	5593	February 26, 2018
NER for Financial Text ner	14	1672	October 25, 2023
Training new entity type with en_pytt_bertbaseuncased_lg model usage , ner , transformers	5	2033	August 30, 2019

Spacy NER - tokeniser for camembert-base

Food for Thought

Example A

Example B

Comparison

Related topics