A great article on how The Guardian is using spaCy and Prodigy to build a model to indentify quotes in new articles Highlights include: why it's so important to iterate on your data and carefully develop custom annotation schemes to deal with ambiguity in language, plus a very pretty custom UI theme for Prodigy!
The main challenge in building the training dataset was navigating the ambiguity of different journalistic styles. For several days, we discussed dozens of cases where it was difficult to make the right choice.
How should we treat song lyrics or poems? What about messages on placards? What if someone quotes their thoughts, something that has not been said aloud?
The first batch of our annotations turned out to be quite noisy and inconsistent but we were getting better and better with each iteration.
Collectively we experienced the same teaching process we were putting our model through. The more examples we looked at, the better we became at recognising different cases. Yet the question remained – if it is difficult for a human to make these decisions, can we teach a machine to cope with this task?