spancat best annotations practices


I need to train a model to extract some information from a transcribed conversation between a client and an agent. The conversation can last up to 20 minutes. I'm thinking of using the spancategorizer model because I have nested labels
Conversations are often broken down like this:
Agent : Hello, what can I do for you?
Customer : Hello, I want information about a product
Agent: Yes, what product is it?
Customer: It's the new smartphone of the brand XXX
From what I understand it is better to limit the size of the samples to be annotated. Is there a recommendation to for example limit the sample size to a certain number of characters? 100, 200 or 500 words for example?

Is there a document that lists the best practices for annotations in Spacy. I saw the INES FAQ #1: Tips & tricks for NLP, annotation & training with Prodigy and spaCy.
Are there any other documents?

Thanks for your help

One of the reasons why we often recommend limit the size is that it can be annoying to scroll while annotating. However, in your case, I can imagine that it's not just spans that you're interested in, but also perhaps relationships between spans. That might be a reason to keep the entire conversation on display. That said, 20 minutes does feel like it's inpractically long.

Best practices differ a lot depending on the application. My best advice is to try annotating a few examples and then to reflect what might make life easier. Although unrelated to your use-case, you may appreciate this tutorial where I go through many iterations of a labelling interface as I'm working on a deduplication problem.

If you can share more details about your use-case I might be able to think along and share some more ideas. Let me know!

Hello Vincent,

Thanks a lot for your help. The video is very interesting, I didn't think it was possible to customize Prodigy so much.

Concerning my case, I completely agree with you, 20 min is long and tedious especially since there are 30 possible labels to identify.
The question is how to cut the conversation without losing some information necessary for the global understanding and therefore not to make mistakes in the annotation.
I'll think about it, we must be able to find rules allowing to realize this cutting, then I'll test it on about fifty examples to see if it's relevant.

Here are also some examples on which I have a doubt on the way to annotate

Example 1:
Agent: I have two phone numbers to reach you: 06 XX XX XX XX and 01 XX XX XX XX XX
Customer: then the first one is good but not the second one.

I have a label that represents the customer's phone number.

Does it make sense to annotate only the first phone number because in this conversation we can see that it is the only one that is valid. This is easily understandable for a human. Or would it be a mistake not to annotate the second one because it is also a phone number (format), even if it is not valid.

In other words, does the attention mechanism of spancat go far enough into the conversation to make the connection between the second phone number and the fact that it is not valid?
and that's why we don't annotate it.

Example 2:
Agent: I have three products to offer you: product A, product B and product C
Customer: I will take product A
Agent: sorry I didn't hear you say that again
Customer: I'll take product A

The product label represents the product chosen by the customer. Do I have to annotate the product twice

Example 3:
Agent: What is your email please?
Customer : my email is david
Agent : yes
Customer : . tourneau
Agent: ok
Customer : @gmail .com
Agent : ok so david dot tourno at

In this case we are not sure about the mail. Unfortunately it is a very regular case. Is it better to annotate each of the parts said by the customer or the string repeated by the agent or both? The spelling of both being potentially wrong because of the transcription.

Thanks for your help

You can always use the output of a model as input for another system. In the case of detecting a phone number, I can totally see how you'd first have a system that can detect all the phone numbers in a conversational turn (maybe using just some regexes to start with) and to have a second system determine which of these phone numbers is valid, if any. This also seems like an easier system to build and maintain because there is a clear separation of concerns. I may be glancing over a detail though.

I think the same logic might apply to your 2nd example, but to your third example ... I wonder if there's a label we might introduce like "start of email dialogue" and "end of email dialogue".

For many of these suggestions, it deserves to mention that these are just ideas. They might work out well, but the only way to know for sure is to try them and to reflect while you're annotating.

Feel free to keep the conversation going though if you're interested in diving deeper in some more examples.

If I understand correctly in the case of phone numbers, you recommend annotating all phone numbers. And then to use another processing to determine if this number is valid or not by analyzing the customer's answer?
In my case the fact that the client says that the first one is good and not the second one.

I thought that spancat could directly determine this from the client's response through the attention mechanism. In any case, I have to test it to see what it does.

Another question, in my case, I have 500 annotated conversations. Some labels are complex like determining the client request. There are about fifty different types of requests:

  • ask for information about a product
  • buy a product
  • return a product
  • etc...

Is it that only experience will allow me to determine how many examples it will take the model to succeed in this task or do we have orders of magnitude pre-determined to know. For example, you need at least 10 different formulations for each type of customer request.

Back when I worked at Rasa I recall recommending folks to have at least 10 examples per intent. This is a rough guesstimate, though. The most important thing isn't the number of examples, but rather that you use actual examples of text that your users give you. If you have 100 examples per intent, but they are unlike the text that your users generate, then they will still have a bad time interacting with your system.

I can imagine that starting with 50 intents is a lot, though. So I wonder, do you really need them all? Could you make an assistant with fewer features first before adding them all? I've seen some projects fail because folks tried to make an assistant that tries to do everything instead of starting with an assistant that can do a few things very well.

Another "tip" that I might offer: an FAQ intent can sometimes be handled by a two-step approach as well. The first step would be recognising that it's an FAQ question and the second step is figuring out if an answer for the question exists in a pre-defined list.

It's funny that you mention Rasa because I had just started to use it and especially the DietClassifier. What made me evolve towards Spacy is the nested annotations. Something that I don't think is possible with Rasa.

I will follow your advice and try to limit the number of intentions, it is possible to group them by family I think.

Thanks for your advice.

You can use the DietClassifier together with a spaCy component and have the spaCy component handle all the entity detection. I think there's a lot of benefits doing it this way, mainly that you're also able to use the pattern matching tools to detect entities as well. Here's an old post of mine on the Rasa blog that explains it in more detail:

Another tip, you might enjoy using the lightweight LogisticRegressionClassifier when starting out instead of DIET. It is much more lightweight than DIET and tends to still perform pretty well.

Thanks a lot Vincent, I'm going to test all this.

One last question about the labeling, is it relevant to pass him examples without labels. There are long passages in the conversation that do not carry any relevant information. Is it interesting to pass these examples so that he learns that there is nothing relevant or not.

I had done the test and I didn't think it was any better to add these examples, but I still wanted a confirmation in case he could still learn something.

There can be good reasons to ignore some examples. I recall an instance where a Rasa instance would sometimes have a user that wrote messages the length of long emails. This caused any number of interesting issues, because chatbots aren't designed to handle pages of text, but examples like this as so rare and so out of the ordinary that it can be fine to ignore.

Then again, if many users had shown this behavior, then we couldn't ignore it. But some of this isn't solved by the ML part of the system, rather by introducing a better UI for the user that makes it clear what kind of messages the bot does/doesn't understand.