Hello Vincent,
Thanks a lot for your help. The video is very interesting, I didn't think it was possible to customize Prodigy so much.
Concerning my case, I completely agree with you, 20 min is long and tedious especially since there are 30 possible labels to identify.
The question is how to cut the conversation without losing some information necessary for the global understanding and therefore not to make mistakes in the annotation.
I'll think about it, we must be able to find rules allowing to realize this cutting, then I'll test it on about fifty examples to see if it's relevant.
Here are also some examples on which I have a doubt on the way to annotate
Example 1:
Agent: I have two phone numbers to reach you: 06 XX XX XX XX and 01 XX XX XX XX XX
Customer: then the first one is good but not the second one.
I have a label that represents the customer's phone number.
Does it make sense to annotate only the first phone number because in this conversation we can see that it is the only one that is valid. This is easily understandable for a human. Or would it be a mistake not to annotate the second one because it is also a phone number (format), even if it is not valid.
In other words, does the attention mechanism of spancat go far enough into the conversation to make the connection between the second phone number and the fact that it is not valid?
and that's why we don't annotate it.
Example 2:
Agent: I have three products to offer you: product A, product B and product C
Customer: I will take product A
Agent: sorry I didn't hear you say that again
Customer: I'll take product A
The product label represents the product chosen by the customer. Do I have to annotate the product twice
Example 3:
Agent: What is your email please?
Customer : my email is david
Agent : yes
Customer : . tourneau
Agent: ok
Customer : @gmail .com
Agent : ok so david dot tourno at gmail.com
In this case we are not sure about the mail. Unfortunately it is a very regular case. Is it better to annotate each of the parts said by the customer or the string repeated by the agent or both? The spelling of both being potentially wrong because of the transcription.
Thanks for your help