DiscoverLinear Digressions
Linear Digressions
Claim Ownership

Linear Digressions

Author: Ben Jaffe and Katie Malone

Subscribed: 5,440Played: 93,363
Share

Description

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.
249 Episodes
Reverse
Kalman Runners

Kalman Runners

2019-10-1300:15:59

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. IMPORTANT NON-DATA SCIENCE CHICAGO MARATHON RACE RESULT FROM KATIE: My finish time was 3:20:17! It was the closest I may ever come to having the perfect run. That’s a 34-minute personal record and a qualifying time for the Boston Marathon, so… guess I gotta go do that now.
Feature engineering is ubiquitous but gets surprisingly difficult surprisingly fast. What could be so complicated about just keeping track of what data you have, and how you made it? A lot, as it turns out—most data science platforms at this point include explicit features (in the product sense, not the data sense) just for keeping track of and sharing features (in the data sense, not the product sense). Just like a good library needs a catalogue, a city needs a map, and a home chef needs a cookbook to stay organized, modern data scientists need feature libraries, data dictionaries, and a general discipline around generating and caring for their datasets.
If you’re a data scientist or data engineer thinking about how to store data for analytics uses, one of the early choices you’ll have to make (or live with, if someone else made it) is how to lay out the data in your data warehouse. There are a couple common organizational schemes that you’ll likely encounter, and that we cover in this episode: first is the famous star schema, followed by the also-famous snowflake schema.
Data scientists and software engineers both work with databases, but they use them for different purposes. So if you’re a data scientist thinking about the best way to store and access data for your analytics, you’ll likely come up with a very different set of requirements than a software engineer looking to power an application. Hence the split between analytics and transactional databases—certain technologies are designed for one or the other, but no single type of database is perfect for both use cases. In this episode we’ll talk about the differences between transactional and analytics databases, so no matter whether you’re an analytics person or more of a classical software engineer, you can understand the needs of your colleagues on the other side.
There are a few things that seem to be very popular in discussions of machine learning algorithms these days. First is the role that algorithms play now, or might play in the future, when it comes to manipulating public opinion, for example with fake news. Second is the impressive success of generative adversarial networks, and similar algorithms. Third is making state-of-the-art natural language processing algorithms and naming them after muppets. We get all three this week: GROVER is an algorithm for generating, and detecting, fake news. It’s quite successful at both tasks, which raises an interesting question: is it safer to embargo the model (like GPT-2, the algorithm that was “too dangerous to release”), or release it as the best detector and antidote for its own fake news?Relevant links:https://grover.allenai.org/https://arxiv.org/abs/1905.12616
When a big, established company is thinking about their data science strategy, chances are good that whatever they come up with, it’ll be somewhat at odds with the company’s current structure and processes. Which makes sense, right? If you’re a many-decades-old company trying to defend a successful and long-lived legacy and market share, you won’t have the advantage that many upstart competitors have of being able to bake data analytics and science into the core structure of the organization. Instead, you have to retrofit. If you’re the data scientist working in this environment, tasked with being on the front lines of a data transformation, you may be grappling with some real institutional challenges in this setup, and this episode is for you. We’ll unpack the reason data innovation is necessarily challenging, the different ways to innovate and some of their tradeoffs, and some of the hardest but most critical phases in the innovation process.Relevant links:https://www.amazon.com/Innovators-Dilemma-Revolutionary-Change-Business/dp/0062060244https://www.amazon.com/Other-Side-Innovation-Execution-Challenge/dp/1422166961
This is a re-release of an episode that originally aired on July 29, 2018.The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can make its wearers run faster. Causal claims like this one are really tough to verify, because even if the data suggests that people wearing the shoe are faster that might be because of correlation, not causation, so I loved reading this article that went through an analysis of thousands of runners' data in 4 different ways. Each way has a great explanation with pros and cons (as well as results, of course), so be sure to read the article after you check out this episode!Relevant links: https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html
When data science is hard, sometimes it’s because the algorithms aren’t converging or the data is messy, and sometimes it’s because of organizational or business issues: the data scientists aren’t positioned correctly to bring value to their organization. Maybe they don’t know what problems to work on, or they build solutions to those problems but nobody uses what they build. A lot of this can be traced back to the way the team is organized, and (relatedly) how it interacts with the rest of the organization, which is what we tackle in this issue. There are lots of options about how to organize your data science team, each of which has strengths and weaknesses, and Pardis Noorzad wrote a great blog post recently that got us talking.Relevant links: https://medium.com/swlh/models-for-integrating-data-science-teams-within-organizations-7c5afa032ebd
Data Shapley

Data Shapley

2019-08-1900:16:553

We talk often about which features in a dataset are most important, but recently a new paper has started making the rounds that turns the idea of importance on its head: Data Shapley is an algorithm for thinking about which examples in a dataset are most important. It makes a lot of intuitive sense: data that’s just repeating examples that you’ve already seen, or that’s noisy or an extreme outlier, might not be that valuable for using to train a machine learning model. But some data is very valuable, it’s disproportionately useful for the algorithm figuring out what the most important trends are, and Data Shapley is explicitly designed to help machine learning researchers spend their time understanding which data points are most valuable and why.Relevant links:http://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdfhttps://blog.acolyer.org/2019/07/15/data-shapley/
This is a re-release of an episode that first ran on April 9, 2017.In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert. Lidar? You betcha! Drive-by-wire? Of course! Probabilistic terrain reconstruction? Absolutely! All this and more this week on Linear Digressions.
loading
Comments (4)

Li Lae

Thx so much, both. Please keep up the good work!

Mar 28th
Reply

Nisarg Shah

Katie and Ben, you both have transformed my journey to learn about machine learning, which seemed impossible before. thanks for taking the time to share your knowledge and providing a fun path to beginners (can only speak for myself :))! I hope you continue this endeavor! we truly appreciate it!

Jul 2nd
Reply

Vikram Kulkarni

Katie should do it by herself, the stupid co host is annoying.

Mar 27th
Reply (1)
loading
Download from Google Play
Download from App Store