Spacy NER - tokeniser for camembert-base

koaning · March 15, 2023, 2:23pm

Hence my idea to add rules because I am afraid that otherwise the number of examples needed to train the spancat would be too big.

My primary concern would be inconsistent labels, as long as those are in there you'll have a bad time training models because there is no solid definition for ground truth.

Generally, I fear that there's no way around needing a large-enough high-quality dataset. But ... there is a trick that came to mind during my morning walk today that might help. Have you considered investigating the noun chunks in your text. There may be an opportunity to re-use a trick I've mentioned here:

The example listed there is for Chinese, but I imagine it could work for French too. You could first make a dataset that contains all the relevant grammatical chunks (mainly noun-chunks suffice, only the data can tell) and then you might be able to annotate the ones that are a "intention". The goal would be to annotate examples to help populate a patterns file and I can imagine that there aren't that many intentions that people are asking for. You might be able to emunerate through a lot of them.

I'm just mentioning this technique because it might help but the only way to know for sure is to try it out. It's helped me in the past, although that was for more NER-types of models.

the problem is that I don't see how to add it in my config file

I'll gladly any answer that you might have on Prodigy, but it would be fair to say that my knowledge on spaCy details is somewhat limited. So just to mention: have you seen our spaCy discussion forum? It's where the spaCy team-members hang out and they usually give very in-depth answers on spaCy in more detail than we might provide here. For example, here's a bunch of questions related to spancat. If you have a detailed spancat questions, it would make sense to ask it there if it's unrelated to Prodigy.

Topic		Replies	Views
Trouble training for Portuguese usage , ner , spacy	15	2514	December 6, 2018
ner.train number of examples usage , ner	8	1953	August 3, 2018
Training a grammar tool usage , textcat	24	5590	February 26, 2018
NER for Financial Text ner	14	1660	October 25, 2023
Training new entity type with en_pytt_bertbaseuncased_lg model usage , ner , transformers	5	2033	August 30, 2019

Spacy NER - tokeniser for camembert-base

Related topics