Watch the video on YouTube
Episode 3 – ‘Trust me, I’ve got the data’
Summary
Open the news lately and it’s a buffet of statistics. Everyone has got data. Politicians with bar charts, campaigners with pie charts and influencers on TikTok with more graphs than a GCSE math revision week. You can’t move for claims that the data proves it.
“Listen to the science”, “We know that…”, “There are studies that show…” But half the time, it’s about as trustworthy as a horoscope written by ChatGPT after three pints.
Now don’t worry, let me just make it clear, I’m not about to go political. We’re not naming sides or flags or colours here. But it’s impossible to miss that in politics, everywhere, from Westminster to Washington, has discovered the intoxicating power of a well-timed statistic, especially one that sounds good in ten words or less. It’s a bit like nationalism’s new best friend, the graph that says, look how great we are, look how wonderful is our country is the best in the world, or we’re going to make it the best in the world. Conveniently cropped before the part that isn’t actually that great or actually impossible to achieve.
And while we’re at it, a quick shout out to the other data experts of our time, the Internet’s most persistent theorists. Do you know the ones that I’m talking about? The ones that know and can provide all sorts of facts and proof and evidence that the Earth is flat or the Moon is fake or the clouds are laced with mind-controlled perfume or even better, birds are not real. They’re absolute pioneers in creative correlation.
Listen, going to poke a little bit of fun sometimes, but I want to be clear, I’m not taking any sides. We’re talking facts today. We’re talking about how dodgy data happens and how to spot it and why it matters whether you’re reading a political manifesto or maybe a Facebook post by shared by your weird Uncle Barry.
So if you were to go and do a bit of a web search through your favorite web browser or your search engine, and you would search for something called funny statistical correlations, but you’ll come out with a ton. There’s one of my favorites, I will link it in the show notes, it is called, I’m just pulling it up, tylervigen.com It’s absolutely hilarious. From the front screen, it shows, quite clearly, that Associates degrees, awarded in performing arts, correlates almost exactly with people Google searching for zombies. Or even further on down, you can see that the annual US household spending on eggs correlates pretty well with Amazon Electric Co’s stock price. And if you didn’t know more about it, if you didn’t know it was making fun, you might look at it and say, well, what have Amazon Electric got to do with the price of eggs? Or even the popularity with a first name Tyler compared to the amount of petrol used in Italy. The fewer people who are born and named Tyler in the world, the less petrol Italy uses.
These things are called spurious correlations. They are correlations that look like on a surface that there is a definite link between one variable and another variable. It’s a masterclass in how numbers can actually tell you absolutely nothing but still look incredibly convincing. What I’m talking about here is something called correlation versus causation. And in a nutshell, what that means is that just because two things look like they move together doesn’t mean that one causes the other. Otherwise we’d have to start calling people Tyler again or just to keep the Italian automotive industry going.
So let’s zoom out, let’s talk about the media, because do you know what? Sometimes data gets dodgy because of mistakes like bad sampling or missing context. Sometimes it’s cherry picked figures or deliberate framing. For example, “90% of people support our policy.” Sounds really impressive until you find out that they actually only asked 10 people and they were all from the same office, one of which thought you were asking if you wanted a cup of tea.
Even simple design choices can mislead. What we call a truncated y-axis on a chart can actually turn a 1% rise into a skyscraper. Add a bit of 3D shading and some patriotic music and you’ve got yourself an emergency news graphic. The key thing, the one thing that I want you to come out, dear listener, is if the statistic that you were looking at makes you go, wow, it’s worth asking, hmm… Now again, I’m going to reiterate just in case anybody takes me out of context. I’m not taking sides. I’m not going to take any sides whatsoever. But politics is a gold mine for this stuff. And I mean, let’s be clear, let’s be a balance. It’s on all sides. Politicians from whatever colour, from whatever size, whatever history, they all do this. They all do this. So let me give you a couple of examples. Do you remember in the UK those Brexit charts? It was on the side of a bus, in fact. to say that when Brexit happened, we would save £350 million a week. Turns out that that figure was a creative accounting exercise. But equally, just in the sake of balance, the counterclaims about the instant economic collapse were, let’s say, somewhat premature. People will often, when you look at a thing like that, and you have a look to see that these messages that are coming out for political campaigns or democratic sort of views, or even actually just any sort of issue that is going to impact the general populace in one way or another. And you’ll often get people to turn around saying, this side is lying, no, that side is lying, or this side is being… spurious with their data or some of the other.
Now what I’m going to say might sound a little bit controversial, but please stick with me. Data is very rarely outright false. It’s not the data that is actually wrong. It’s just how it’s been dressed. Just like when somebody says, unemployment is down, but forgets to mention that the only reason the actual number is down is because the definition of what unemployment means was actually changed previously the previous year. That’s what I mean about framing. A statistic can be perfectly true and still completely misleading. If you pick your timescale or your sample or your comparison cleverly enough, And this is something that all political parties of all sides, and actually let’s be clear, not just politicians, not just governments do this, just in case anyone thinks I’m sort of rising for anarchy or something along those lines. Corporations do this as well. Businesses do this all the time. They’ll create some facts and some figures just for the simple purpose of… creating and purporting and reporting on the message that they want to give.
So that’s what we’re talking about. Correlation, causation, and also framing. The data is not wrong. It’s how it is translated and how it is communicated that can be the issue. And in today’s world of things like tweets and reels and memes, nuance doesn’t really trend simple emotional claims do. So the complex data often gets squashed into like black and white statements. I’ll give a classic example. I said I wasn’t going to talk about anything sort of specific on this particular course, but on this call. But… Recently, in the news, you can’t get away from it. If you listen to this quite soon after I recording it, it’s in the news at the moment about a study that shows, without any shadow of a doubt, that a particular medicine if taken while pregnant had a very high correlation with a with with neurodivergence in in children and if you listen to one side talking about it they will talk about how how “studies have shown” and claims have made and they’ll give you numbers and they’ll give you figures And then you listen to the other side about it, the people who are representing opposition to this point, and they’ll also provide numbers and they’ll also provide figures.
Who’s right? Who’s to say? I think it’s one of those issues that it’s still very sort of dead in the water. But what I can tell you is when I look at the arguments of both sides is that both of them follow exactly the same trick on putting their message across in such a way in order to make the quick reader, the headline reader, believe their point of view. So again, let me just put this on. Before it reaches for pitchforks, this isn’t really about politics. Every side does it. Corporations, charities, campaigners. Do you know what? Even your mate tries to do this while they justify why pineapple belongs on pizza.
Spoiler alert, it does. But everyone wants the data to back them up. And that’s why, dear friends, learning to spot dodgy visuals and missing context is becoming a bit of a survival skill. Because no one’s immune, not governments, not companies, and definitely not those algorithms of social media sites or websites or web searches. Some of which may even brought you guys to me. So, how do we start building trust in data again? So whether you’re running a business, or even if you’re just a person sitting at home on the sofa scrolling through your feed, the same rules apply. So, the general view is about ethical data handling. So be transparent. If you show a number, say where you’ve got the number from. Just actually point out where it’s come from. Not to say “some studies have shown”. Don’t just say, “oh, there was this article that basically proved such and such”. Where has it come from? Just be completely transparent. Be also reproducible. So if somebody else was to follow your line of thinking, would they get exactly the same result? And the other thing as well is that you should document everything.
You should write everything down. If you actually dig into some of these studies and some of these papers, some of which I’ve actually referred to on one of my recent blogs, um… Let me just rewind a little bit because I did put a blog out my website last week that talks about an MIT study. So it’s an MIT, the famous educational institution, has published a paper that suggests that AI, overuse of AI can actually damage your brain. Now, if I was to say that as a headline, everyone’s got it right. “I knew that ChatGPT was evil.”, “I knew that Gemini was going to come and kill us one day.” And that’s the headline. But you would actually have to go and read the entire paper, or at least the majority of it, to understand exactly what it is it’s trying to say. And the good news is, folks, if you’re still using ChatGPT to go and get your biscuit recipes, honestly, you’re fine. Your brain’s not going to go and shrivel away and die just yet. So anyway, that’s been linked on my website. Go and have a look at the blog, datawithduke.com. You will see it there. Is AI killing our brains is the name of the blog. Go and have a read of it.
Anyway, back to the point. So I was talking about being transparent, being reproducible, and being able to document everything in order to hit all of those sort of cornerstones of ethics of data communication. So in industry is part of what we call data governance. And data governance basically covers everything that data sort of like touches. It’s why analysts keep version control and engineers use audit trails and organizations follow frameworks like Well, in the UK, we have the Data Protection Act of 2018 that was built upon the General Data Pro- Oh, I’ve forgotten it. The General Data Protection Regulations, which, yes, they do actually give you a person rights over your own personal data.
Listen, I’m going to come back to that in a future episode. We’re going to dive properly into those. But here’s the short version, is that you as a UK citizen, if you work in the UK, or actually, if you were a member, an EU member state when GDPR came out in 2017, 2018, you have a right to request what data companies hold on you. You can also ask them to delete it, and you can also even challenge them to ask them how it is used. Right? Now you might not be able to stop the algorithm showing you dog memes, but at least you can understand why they’re doing it. So just have a think about that. Data Protection Act, general data protection regulations. They work side by side. In fact, the Data Protection Act is GDPR encapsulated into UK law. So they do sort of work together. But yeah, they give you rights as a data subject. More on that on a future project. on a future episode.
So, what I’m talking about is correlation does not equal causation. Just because the numbers look like that they are saying something doesn’t mean that they are. What I’m suggesting that you do, ladies, gentlemen and legends, is to start questioning where did those numbers come from? Actually, where did they come from? And with a little bit of additional Googling skills or web search skills, whatever your preferred search engine is, you can normally find out where the numbers actually come from. Of course, the difficulty is finding the time to be able to do it, which is why so many of us do rely on media. What I’m asking you, ladies and gentlemen, is where is the media getting their numbers from? What message are they trying to give? So have a think about that over this next week or so. Next time you see a stat in the news or on social media or even potentially coming out of a politician’s mouth, just take a second. Just take a couple of seconds to ask yourself three questions.
- Where do this data come from? Number one.
- Number two, what’s missing from what they are telling us? And three…
- Actually, does it even make sense what’s coming out? Is there actually a context that is required from this sort of information?
If you want to actually have really fun, see if you could share the most ridiculous one that you find, whether it’s like a conspiracy meme, a chart shaped like Mount Doom, or a headline that confidently states something completely impossible. Stick it on the comments on YouTube. Send me an email through the website, datawithduke.com. And I’ll share funny examples in a future episode. Because, listen folks, at the end of the day, this isn’t about arguing politics. This isn’t about arguing as to which organization is better or which side is better than the other. It’s about being fair and it’s about being curious. And it’s about making sure that we do not get played by the numbers. After all, data doesn’t lie. But the people using it definitely can. So have a think about that. I look forward to hearing from you folks. Next time, we’re going to talk about this term data steward. And my next one is going to be called everyone is a data steward. How every job from a barista to a CEO involves handling data responsibly. Until then, folks, question everything, especially anyone who says, “Trust me, I’ve got the data.”
