Episode 65: Scaling Data Pipelines with Nick Schrock, Founder/CTO of Dagster Labs
Description
Intro
Mike Schwartz: Hello and welcome to Open Source Underdogs! I’m your host Mike Schwartz, and this is episode 65 with Nick Schrock, Founder and CTO of Dagster, a platform that helps companies create data pipelines, which is critical to transform and update data in order to make it useful, for example, to generate reports, content, or other actionable information.
Dagster might not be a blueprint you can emulate. Like all start-ups, there are some hard to replicate serendipity that enables Nick and his team to build this amazing company. But as Machiavelli says, “Great leaders need both – fortune and virtue.” In other words, you need to be good at what you do, i.e. virtue, but they also need some good old-fashioned luck.
But what separates a really successful founders, like Nick, is the ability to harness fortune and virtue and combine it with some deep insights about the market, and turn it into a profitable and fast-growing venture not easy to do.
So, with that said, let’s cut to the interview, and let Nick tell you, in his own words, how Dagster evolves.
Early Career
Nick Schrock: Great to be with you.
Mike: Nick, thanks for joining us today.
Mike: Can I just go back a little bit and ask you to share some of your story about how you ended from going from the University of Michigan Computer Science to working at Facebook? So, that early period – how that happened?
Nick: Oh, I wasn’t expecting to talk about the preface book days. I’ll do the quick version of that. I graduated from Michigan in 2003, and I actually went to work at Microsoft, right out of school. And Microsoft’s a great company, and they treated me well, but…
And actually, the division I was in was the developer division. And I thought that they were just extraordinarily talented, but at that time of my life, that wasn’t for me, in terms of working at a big company.
I wasn’t actually sure if I wanted to do software anymore, so I went to the London School of Economics for a year, because I thought I might want to go more into finance, or even government service – you know, I was a young man kind of searching around.
But I ended up getting back into software. I worked for a healthcare start-up out of Ann Arbor, which is where Michigan is, for what – 2 and a half years.
And then, I went to Chicago to try to do a start-up. That was very quickly spun down because me and a friend, who had worked in the finance industry, we wanted to do it, but then, it was about 6 months before the financial crisis.
So, that was incredibly poor timing. I spun that down, and actually, turns out a friend of mine, who I knew from Microsoft, kind of heard that was on the open market, and he just reached out and was like, “Hey, I’m working at Facebook, it’s really a special place. You should consider looking at it.”
And I was looking at staying in finance in the Chicago area. And I flew out to Facebook, and it’s just the vibe difference between a place like Facebook and a hedge fund in Chicago cannot be overstated.
You know, everyone at Facebook was young, super excited, idealistic, the office was incredible – there was just all this energy versus all these miserable people working in the hedge fund. So, the choice was obvious from there. And then, off to the races after that.
Why was Facebook so innovative in 2009-2015?
Mike: So, what was it about Facebook in 2009 that made it such a hotbed of innovation? Like, what new problems were they trying to solve?
Nick: The engineering-driven culture there, combined with the actual product that was being built. So, the product grew at unprecedented rates, it was used in unprecedented ways and was data intensive also, in kind of an unprecedented way.
We were forced to kind of do a lot of innovation on the fly in incredibly constrained environments actually, both in terms of resources, timing – you know, we had to get stuff to work. And I think that it is true that those constraints do breed innovation.
And that time of period was interesting because in 2009 – how to put this – we weren’t really taken seriously as an engineering organization, I felt. And then, fast forward say 4 to 6 years, and we were taken very seriously as an engineering organization.
It was really cool to participate in that. And in the end, if you look at the output from that eng org at that time, it really is pretty extraordinary in terms of what systems were built internally as well as what was open-sourced.
Technical Origin
Mike: So, few years back in 2018, after being at Facebook for, I guess, maybe 8 or 9 years, you decide to start a company called Elementl, which becomes Dagster Labs. Can you talk a little bit about how that came about?
Nick: Near at the beginning of my tenure at Facebook, I helped create this team called Product Infrastructure, whose mission was to make our application developers more efficient and productive. So, concretely what that meant is that we build internal frameworks and abstractions for the engineers who actually built the site and the mobile apps to build product.
That team did a lot of great work, and we ended up externalizing about a bunch of that work in the form of open source. So, React came out of that group – I had nothing to do with React, but kind of the people across the hall from me, so to speak, produced React. And that obviously went on to be an extremely successful open-source framework. And then, what I’m personally more affiliated with is, I’m one of the co-creators of GraphQL.
I’ve lived and breathed developer tools for a long time and also seen the impact that open-source adoption at scale can have. So, that was definitely on the mind when I left Facebook in 2017, and figuring out what to do next.
And in fact, I was going around the Valley and talking to companies, both inside and outside the Valley actually, about what their biggest technical liabilities were.
And this notion of data, an ML Infrastructure kept on coming up over and over and over. And I decided to dig into this, and very quickly I discovered that this area kind of pattern matched to what I care about and the types of problems I want to work on, typically the things I like to work on is to share a bunch of properties.
One are just engineers in pain. Like their dev workflow is broken, they have bad abstractions, they’re not productive, and purely because of tooling and abstraction reasons – that actually kind of makes me angry and frustrated on their behalf. And on a personal level, I feel that is really motivating.
Second involved finding – yeah, I like to call it like “a problem that matters”. I like working on really broad horizontal problems that could potentially have impact on millions of developers, kind of core essential problems that matter.
I was data engineering adjacent at Facebook, I wasn’t a practitioner. Data pipelining is extraordinarily important actually. People like to dismiss it as data cleaning, or they are kind of data janitor work, but when I looked at it, from kind of fresh perspective and I really thought about it, I was like, listen, data pipeline, they produce these assets, these data assets that drive all analytics, all the dashboards that you work with, all the ML models.
And if you really think about it, these data assets drive a huge proportion of human decision-making and automated decision-making in our entire society. Who gets mortgages or not, how do we price health care, what kind of news do you see – these are fundamental essential things, and it needs to be built on solid foundations.
And the fact that it – in my opinion – like, it was not built on the appropriate tools and processes, and everyone felt it was like chaotic and out of control all the time, was deeply disturbing. So, things were fundamentally, and still, in some ways, are fundamentally broken in data ML engineering. So, that’s really motivating.
Another thing, another property is that I like working on technologies that are sort of a strategic point of leverage in an organization. GraphQL fits that bill. Because if you kind of can intermediate all client-server interactions with a common software layer that has rich scheme information and stuff like that, it’s like an enormous point of leverage for tooling.
And in the data space, I quickly gravitated towards the orchestration layer because I felt it had the same properties. You know, orchestration orchestrates data pipelines. That means, it invokes every single runtime, it touches every single storage system as a result. And then, likewise, any practitioner that wants to put a data asset or pipeline into production has to interact with orchestrator in some way shape or form. So, a strategic point of leverage, I thought that was super, super industry.
And then last, like some feeling that you have a technical insight that’s novel and interesting, and that’s kind of how we got to this notion of — at the beginning we called it Software Structure Data Sets, but now we call it Software-defined Assets in data pipeline.
And the basic idea is that instead of just writing a bunch of imperative tasks to string stuff together, you instead think about it, you write a software representation of the data asset that you end up wanting to ship to produc