How The Guardian is using spaCy and Prodigy to identify quotes in text

A great article on how The Guardian is using spaCy and Prodigy to build a model to indentify quotes in new articles :sparkles:Highlights include: why it's so important to iterate on your data and carefully develop custom annotation schemes to deal with ambiguity in language, plus a very pretty custom UI theme for Prodigy!

The main challenge in building the training dataset was navigating the ambiguity of different journalistic styles. For several days, we discussed dozens of cases where it was difficult to make the right choice.

How should we treat song lyrics or poems? What about messages on placards? What if someone quotes their thoughts, something that has not been said aloud?

The first batch of our annotations turned out to be quite noisy and inconsistent but we were getting better and better with each iteration.

Collectively we experienced the same teaching process we were putting our model through. The more examples we looked at, the better we became at recognising different cases. Yet the question remained – if it is difficult for a human to make these decisions, can we teach a machine to cope with this task?

8 Likes

Love that they also customised the UI to the brand :smile:

1 Like

Excellent. The article has given me a better idea of what to do with the project I am into. Our quotations will be many and perhaps large, inside court decisions:

  1. Large text of other court decisions. Precedents: They refer to a court decision that is considered as authority for deciding subsequent cases involving identical or similar facts.
  2. Short text of doctrine from authors.
  3. Short text of articles in any particular law.

We are building an AI Legal Assistant on Constitutional Law and decided to combine OpenAI GPT3 and Prodigy.

1 Like

Awesome!

Congratulations on the team for your excellent tools (both spaCy and Prodigy), glad to see them get the attention they deserve!

If you have any other similar articles or mentions of Prodigy in the industry, please share them as well.
Keep up the great work!
Cheers

1 Like