#274: Real Talk About Synthetic Data with Winston Li

Update: 2025-06-24

Description

Synthetic data: it’s a fascinating topic that sounds like science fiction but is rapidly becoming a practical tool in the data landscape. From machine learning applications to safeguarding privacy, synthetic data offers a compelling alternative to real-world datasets that might be incomplete or unwieldy. With the help of Winston Li, founder of Arima, a startup specializing in synthetic data and marketing mix modelling, we explore how this artificial data is generated, where its strengths truly lie, and the potential pitfalls to watch out for!

Articles, Events, and a Paper Mentioned in the Show

(Article) How to Pick a College Major in an AI-First World by Cassie Kozyrkov

(Event) The ANA Measurement & Analytics Conference

(Event) Chicago MeasureCamp 2025

(Paper) Large Language Models are as persuasive as humans, but how? By Carlos Carrasco-Farre

Photo by Anton Shuvalov on Unsplash

Episode Transcript

00:00:05 .75 [Announcer]: Welcome to the Analytics Power Hour. Analytics topics covered conversationally and sometimes with explicit language.

00:00:15 .13 [Michael Helbling]: Hey everyone, welcome. It’s the Analytics Power Hour, and this is Episode 274. Yet today we’re diving into a topic that sounds like maybe it came from a sci-fi script, but it’s actually very much part of the real-world data landscape. That’s right, synthetic data. You know, whether you’re using it for machine learning, protecting privacy, or just giving your dashboard something to chew on when the real data won’t play nice, It’s definitely having a moment in our industry. And unlike original data, it won’t ghost you with missing values or weird outliers or those inexplicable rows that look like someone fell asleep on the keyboard. We’ll talk about how it’s made, where it shines, and where it might fall short. So whether you’re deep in data science or just data curious, I think this podcast will be for you. And it’ll be kind of like synthetic data, hopefully generated with purpose and surprisingly useful. What first let me introduce my co-hosts Val Kroll. How are you going? Or how are you doing? I’m so used to introducing you.

00:01:14 .75 [Val Kroll]: Yeah, I’m not Moe.

00:01:16 .15 [Michael Helbling]: I know. How are you?

00:01:18 .64 [Val Kroll]: I’m doing good, Michael.

00:01:20 .70 [Michael Helbling]: Happy to be here. And also joined by Julie Hoyer. Julie, welcome. How are you doing?

00:01:27 .03 [Julie Hoyer]: I’m doing great. I cannot wait to talk about this topic.

00:01:29 .98 [Michael Helbling]: I know. I’m excited as well. And I’m Michael Helbling. And so for this show, we absolutely needed a guest. Winston Lee is the founder of ARIMA, a startup specializing in synthetic data and marketing mix modeling. Prior to that, he led data science teams at PWC Canada and Omnicom Media Group. He’s also a lecturer at Northeastern University and sits on their program advisory committee for the Masters in Analytics. And today he is our guest. Welcome to the show, Winston.

00:01:57 .83 [Winston Li]: Thank you. Thank you. Great to meet you. Great to meet everyone. And thanks, Michael.

00:02:01 .98 [Michael Helbling]: Awesome. Well, I think a great place to start on this topic of synthetic data is really just to talk about what it is. So, you know, how would you define synthetic data and what makes it fundamentally different from anonymized or sample data, let’s say? Yeah, very good question.

00:02:22 .28 [Winston Li]: So synthetic data, to put it simple, it’s data sets that are generated by an algorithm as opposed to being collected from some sort of real event. So specific to consumer data, which is the space that we work in, we’re not going out to conduct surveys. We’re not tracking people. We’re not asking people to provide anything to sign up. None of that. Synthetic data simply means we develop a computer algorithm where we could generate data in a way that mimics real data, so to say. So we’re not just randomly spreading out numbers, we’re generating data based on certain patterns. Obviously, there are two things that we should note. One, we’re not making up data. A lot of people, when they say synthetic data, they think of the word fake. They think it’s fake data. It’s not. And the algorithms that we use to generate synthetic data is indeed trained on real data. So it is based on learnings of patterns from real data in which we generate synthetic data. So first point is that it’s not fake. It’s just like real data. And it is very useful for statistical analysis. We’ll get to the whole discussion of privacy a little later on, which is the main motivation of synthetic data. But from a utility standpoint, In theory, it should be as useful as real data. The other thing I also want to point out is there’s actually a lot of synthetic data in our day-to-day lives that we simply don’t realize that being synthetic data. We call them something different, but they are, in fact, along the same lines of same motivation, so to say. If you imagine something like a mid-journey, you know, we’re generating synthetic images. If you consider image to be data, then synthetically generated images like synthetic faces or, you know, synthetic pictures of, you know, different places, animals, sceneries, whatever, that’s a form of synthetic data too. We’re simply generating synthetic pixels, so to say, but based on the patterns so that they look like a picture in the end. Same thing with, you know, even chat GPT or Gemini. If you consider again, if you consider words to be data, that too is a form of synthetic data. So the fact that we use computers to generate some form of information, let’s say, based on learnings from real information is much more common than we see and is much more common than we recognize. And broadly speaking, that that is synthetic data.

00:05:14 .51 [Val Kroll]: I was 100% I’ll admit in the camp of thinking synthetic data, just like with something materialized out of thin air. But I have to say that I did have the benefit of getting to see you present Winston at measure camp New York slash New Jersey a couple of months ago, which was a fantastic presentation. I learned a ton about it. I think one of the other questions I’d love to ask you as we’re like kicking this off. It’s like, what are some of the really common use cases that, you know, people within our field are using synthetic data. Like what problems is it really solving?

00:05:51 .62 [Winston Li]: Yeah. In some ways, synthetic data does not attain to new use cases, so to say. It’s not like there is something synthetic data can do that real data cannot do. People consider synthetic data more as a way to, let’s say, be able to do things that, you know, privacy laws don’t otherwise allow them to do. So So in some sense, you know, people are doing certain things with synthetic data because doing to try to do the same thing with real data while technically possible is from a procurement or from a legal standpoint, very, very difficult. So there’s actually nothing special about synthetic data other than the fact that it is.

00:06:36 .86 [Val Kroll]: So should we just wrap it up here?

00:06:42 .09 [Winston Li]: I don’t want people to think synthetic data is fundamentally different than real data in any way that I have to pick one or the other. The best analogy I can think of is think of it like a photocopy document, if you will. You have an original document. You don’t want to use it. You’re scared to lose it for various reasons. You make a photocopy of it. That photocopy version will bring enough use, bring enough utility, just like the original document. It’s much safer to work with the photocopy version because you can write on it. You’re not afraid that you’re going to lose it. In a data scenario, obviously, you don’t have to worry about people suing you

Comments

In Channel

#281: Analytics: The View from the Corner Office with Anna Lee

2025-09-3001:06:14

#280: Dashboards Must Die! Long Live Dashboards! with Andy Cotgreave

2025-09-1601:06:50

#279: The Process(es) of Analytics (We Have Thoughts)

2025-09-0201:01:27

#278: Is AI Good at Data Analysis? That’s the Wrong Question! with Juliana Jackson

2025-08-1901:00:46

#277: ANOVA? I Hardly Know Ya’! with Chelsea Parlett-Pelleriti

2025-08-0501:00:29

#276: BI is Dead! Long Live BI! With Colin Zima

2025-07-2201:04:41

#275: The Modern Data…Job Search with Albert Bellamy

2025-07-0801:13:00

#274: Real Talk About Synthetic Data with Winston Li

2025-06-2458:05

#273: Data Products Are… Assets? Platforms? Warehouses? Infrastructure? Oh, Dear. with Eric Sandosham

2025-06-1001:09:39

#272: When the Metric is Calculated and Complex with Dan McCarthy

2025-05-2701:03:54

#271: It Might Be Irrational, but Let’s Talk Behavioral Science with Dr. Lindsay Juarez

2025-05-1301:00:02

#270: AI and the Analyst. We’ve Got It All Figured Out.

2025-04-2901:01:09

#269: The Ins and Outs of Outliers with Brett Kennedy

2025-04-1501:08:19

#268: You Get an Insight! And YOU Get an Insight! with Chris Kocek

2025-04-0101:07:13

#267: Regression? It Can be Extraordinary! (OLS FTW. IYKYK.) with Chelsea Parlett-Pelleriti

2025-03-1801:01:11

#266: AI Projects: from Obstacles to Opportunities

2025-03-0458:59

#265: Connected Wellness in the Age of AI with Michael Tiffany

2025-02-1855:02

#264: When the Analyst’s Toolbox Includes Assessing the Zeitgeist with Erika Olson

2025-02-0401:08:14

#263: Analytics the Right Way

2025-01-2101:06:18

(Bonus) 2024 Listener Survey…Wrapped!

2025-01-1422:46

00:00

1.0x

#274: Real Talk About Synthetic Data with Winston Li

Michael Helbling, Tim Wilson, Moe Kiss, Val Kroll, and Julie Hoyer

#box-pro-ellipsis-176043104872834{-webkit-line-clamp:2;}#274: Real Talk About Synthetic Data with Winston Li

Articles, Events, and a Paper Mentioned in the Show

Episode Transcript

#274: Real Talk About Synthetic Data with Winston Li

Michael Helbling, Tim Wilson, Moe Kiss, Val Kroll, and Julie Hoyer

#274: Real Talk About Synthetic Data with Winston Li