NER detection and comma (,)

Hello,

i understood a lot of things today. Thanks for that.
But i have another problem.
Why NER system detects comma (,) like an entity with original model (fr) ?
Sometimes it’s LOC, sometimes ORG or MISC.

I had not this problem before with spacy 2.0.5.
It is really a strange behavior.

I use copy-pasted text from wikipedia. formatted with UTF-8 without BOM.

Thanks for your time.

Does this happen in spaCy when you look at doc.ents, or when you use Prodigy and a recipe like ner.teach?

spaCy's detection of entities shouldn't have changed between v2.0.5 and the latest version. The French entity recognizer was also trained on data from Wikipedia – so if you use Wikipedia text, it should perform okay.

When you use Prodigy and an active learning recipe, you don't always see the "best" entity analyses. Instead, ner.teach will usually ask you about entities that the model is most unsure about. Prodigy will get all possible analyses for the text, and ask you about the entities with a prediction closest to 0.5.

For example, let's say spaCy produces the following analysis for the phrase "Merci, Google.". The numbers are the confidence of the entity predictions. (Disclaimer: This is just an example, so the numbers are not real. Also, sorry for my stupid example – I wish my French was better :wink:)

Merci, Google. --> ['Merci', ',', 'Google', '.']
                      0.2    0.5     0.9
                      LOC    MISC    ORG

If you only run spaCy, doc.ents will probably just return one entity: (Google, ORG). Prodigy on the other hand may skip this example, because the confidence of 0.9 is very high. Instead, it will ask you about the prediction with a 0.5 confidence, the comma. Those predictions are also the ones that need your "human feeback" the most – usually more than the entities the model is already very sure about.

If you just want to run a model over your text and correct its predictions (i.e. spaCy's doc.ents) with no active learning, you can also use the ner.make-gold recipe.

Hello,

It happens when i copy paste text from wikipedia in a text file (notpad++ for example) using doc.ents with spacy.
If i put text in variable like this :

var = "texte here, …
… other sentence "

spacy is working well.
But same text in a text file with sentence = open(test_text, “r”).read()
give me some differents results with doc.ents (very close but a little bit different and comma are detected like LOC / ORG / MISC, etc …)

example from text file :
Mot :
, Entity : ORG
Mot :
, Entity : LOC
Mot : GBR, Entity : MISC
Mot :
, Entity : MISC
Mot :
, Entity : ORG
Mot : 3D de bâtiments, Entity : MISC
Mot : Antiquité, Entity : MISC
Mot : Rome antique, Entity : LOC
Mot :
, Entity : LOC
Mot :
, Entity : LOC
Mot : Cyclades, Entity : MISC
Mot : Péloponnèse, Entity : LOC
Mot :
, Entity : LOC
Mot : Pausanias, Entity : PER
Mot : Byzès de Naxos, Entity : PER
Mot : pin de Macédoine, Entity : LOC
Mot : Péloponnèse, Entity : LOC
Mot : Sicyone, Entity : LOC
Mot : Égypte, Entity : LOC
Mot :
, Entity : LOC
Mot :
, Entity : ORG
Mot :
, Entity : ORG
Mot : Périclès, Entity : PER
Mot : Plutarque, Entity : PER
Mot : France, Entity : LOC
Mot : Encyclopédique de Roret, Entity : LOC
Mot : tuile de, Entity : LOC
Mot : tuile d’Altkirch, Entity : LOC
Mot :

same text in variable :

Mot : GBR, Entity : MISC
Mot : Antiquité, Entity : MISC
Mot : Rome antique, Entity : LOC
Mot : Cyclades, Entity : MISC
Mot : Péloponnèse, Entity : LOC
Mot : Pausanias, Entity : PER
Mot : Byzès, Entity : PER
Mot : Naxos, Entity : LOC
Mot : Macédoine, Entity : LOC
Mot : Péloponnèse, Entity : LOC
Mot : Sicyone, Entity : LOC
Mot : Égypte, Entity : LOC
Mot : Périclès, Entity : PER
Mot : Plutarque, Entity : PER
Mot : France, Entity : LOC
Mot : Encyclopédique de Roret, Entity : LOC
Mot : tuile de, Entity : LOC
Mot : tuile d’Altkirch, Entity : LOC

maybe open.read() method is not a best practice.

spaCy's entity recognizer is sensitive to whitespace characters – it assumes that if whitespace is provided, it might be important. That might be why you see different results when you copy-paste the text and add one sentence per line.

The small French model is a good starting point – but it's never going to be perfect. You'll always need to improve and tune it on your specific data. Sourcing good training data is difficult – this is also why we developed Prodigy :wink:

An easy way to improve the model on your data is to use ner.teach. You can find more details and examples in the named entity recognition workflow.

prodigy ner.teach your_dataset fr_core_news_sm /path/to/your_data.jsonl

This will ask you about the predictions that the model is most uncertain about, and you can accept or reject them.

You can also use ner.make-gold, which will only show you the entities available in doc.ents. You can then correct them manually. This lets you remove the wrong entities for commas etc., and add entities that the model missed:

prodigy ner.make-gold your_dataset fr_core_news_sm /path/to/your_data.jsonl --label ORG

I don’t really know what’s happen.
i removed all whitespaces between punctuation and problem is still the same.
Sometimes (spacy and prodigy) wants i annotate just : or just - in J-C.
In this case how can i create a training data and a revision data manually ?

Hello,

Ok sorry, i just understood what you wanted to say.
I will try this recipe.

Thanks

1 Like