Custom entity list separator when using LLM

didmar · August 21, 2023, 4:10pm

Hello!

I would like to use ner.llm.fetch but my entities can contain a comma: for example, John Smith, PhD gets split into two entities, John Smith and PhD.
Is there a way to customize the recipe to use a different separator, such as a semi-colon?
(I did see a parameter to set a custom prompt, but without changing the code that does the parsing I guess it won't work)

koaning · August 22, 2023, 9:28am

Hi Didier!

Interesting. Could you share the LLM provider that your using and your spacy config file?

My initial feeling towards a fix is to add label definitions and to make it explicit that commas are inclusive. But it also seems very likely that our parser is to blame here.

I'll notify some colleagues in the meantime, but if you could share the extra info that would be helpful.

didmar · August 22, 2023, 10:13am

Hi Vincent,

Thank you for the quick reply.
I'm using GPT-3.5. My spacy-llm config looks like this:

[nlp]
lang = "fr"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = [...]

[components.llm.task.label_definitions]
...

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "ner_examples.yml"

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10

koaning · August 23, 2023, 7:55am

I couldn't help but notice that you're dealing with a French use-case. Out of curiosity, are you happy with the performace? My impression is that OpenAI works best with English but I've seen it do allright for Dutch as well.

The spaCy team has "delimiter choice" on the roadmap, but in the meantime you might be able to unblock yourself from the current situation by writing a custom task. The spaCy docs for this can be found here:

It does involve writing your code, but you should be able to take inspiration from the NER example on Github. You can find the Jinja template for this task here.

didmar · August 23, 2023, 12:10pm

Thanks for the pointer, I'll check out how to write a custom task then.

As for my use case, it seems to work fine even though the prompt, entities and their definitions are English and the dataset in French.
Individual differences in performance (between entities) seems likely due to how clear and precise my definitions were. For example, it had no trouble picking up persons name, while it struggled with more abstract concepts like name of a (legal) authority.

Topic		Replies	Views
named entity extraction wrong usage , spacy , solved , off-topic	5	673	July 23, 2019
NER detection and comma (,) ner	5	2134	March 28, 2018
Spacy tags punctuations usage , ner , spacy , solved	3	548	November 19, 2018
Recipe "ner.openai.correct" uses openai models with low token limit ner	1	310	July 14, 2023
Training a new model, using OpenAI API usage , ner , spacy	2	38	January 18, 2025

Custom entity list separator when using LLM

Related topics