Custom entity list separator when using LLM

Hello!

I would like to use ner.llm.fetch but my entities can contain a comma: for example, John Smith, PhD gets split into two entities, John Smith and PhD.
Is there a way to customize the recipe to use a different separator, such as a semi-colon?
(I did see a parameter to set a custom prompt, but without changing the code that does the parsing I guess it won't work)

Hi Didier!

Interesting. Could you share the LLM provider that your using and your spacy config file?

My initial feeling towards a fix is to add label definitions and to make it explicit that commas are inclusive. But it also seems very likely that our parser is to blame here.

I'll notify some colleagues in the meantime, but if you could share the extra info that would be helpful.

Hi Vincent,

Thank you for the quick reply.
I'm using GPT-3.5. My spacy-llm config looks like this:

[nlp]
lang = "fr"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = [...]

[components.llm.task.label_definitions]
...

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "ner_examples.yml"

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10

I couldn't help but notice that you're dealing with a French use-case. Out of curiosity, are you happy with the performace? My impression is that OpenAI works best with English but I've seen it do allright for Dutch as well.

The spaCy team has "delimiter choice" on the roadmap, but in the meantime you might be able to unblock yourself from the current situation by writing a custom task. The spaCy docs for this can be found here:

It does involve writing your code, but you should be able to take inspiration from the NER example on Github. You can find the Jinja template for this task here.

Thanks for the pointer, I'll check out how to write a custom task then.

As for my use case, it seems to work fine even though the prompt, entities and their definitions are English and the dataset in French.
Individual differences in performance (between entities) seems likely due to how clear and precise my definitions were. For example, it had no trouble picking up persons name, while it struggled with more abstract concepts like name of a (legal) authority.

1 Like