DiscoverLinear Digressions
Linear Digressions
Claim Ownership

Linear Digressions

Author: Ben Jaffe and Katie Malone

Subscribed: 5,234Played: 81,380
Share

Description

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.
241 Episodes
Reverse
Data Shapley

Data Shapley

2019-08-1900:16:553

We talk often about which features in a dataset are most important, but recently a new paper has started making the rounds that turns the idea of importance on its head: Data Shapley is an algorithm for thinking about which examples in a dataset are most important. It makes a lot of intuitive sense: data that’s just repeating examples that you’ve already seen, or that’s noisy or an extreme outlier, might not be that valuable for using to train a machine learning model. But some data is very valuable, it’s disproportionately useful for the algorithm figuring out what the most important trends are, and Data Shapley is explicitly designed to help machine learning researchers spend their time understanding which data points are most valuable and why.Relevant links:http://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdfhttps://blog.acolyer.org/2019/07/15/data-shapley/
This is a re-release of an episode that first ran on April 9, 2017.In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert. Lidar? You betcha! Drive-by-wire? Of course! Probabilistic terrain reconstruction? Absolutely! All this and more this week on Linear Digressions.
In October 2005, 23 cars lined up in the desert for a 140 mile race. Not one of those cars had a driver. This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose car could do it the fastest); the winning car, Stanley, now sits in the Smithsonian Museum in Washington DC as arguably the world's first real self-driving car. In this episode (part one of a two-parter), we'll revisit the DARPA grand challenge from 2005 and the rules and constraints of what it took for Stanley to win the competition. Next week, we'll do a deep dive into Stanley's control systems and overall operation and what the key systems were that allowed Stanley to win the race.Relevant links:http://isl.ecst.csuchico.edu/DOCS/darpa2005/DARPA%202005%20Stanley.pdf
The modern scientific method is one of the greatest (perhaps the greatest?) system we have for discovering knowledge about the world. It’s no surprise then that many data scientists have found their skills in high demand in the business world, where knowing more about a market, or industry, or type of user becomes a competitive advantage. But the scientific method is built upon certain processes, and is disciplined about following them, in a way that can get swept aside in the rush to get something out the door—not the least of which is the fact that in science, sometimes a result simply doesn’t materialize, or sometimes a relationship simply isn’t there. This makes data science different than operations, or software engineering, or product design in an important way: a data scientist needs to be comfortable with finding nothing in the data for certain types of searches, and needs to be even more comfortable telling his or her boss, or boss’s boss, that an attempt to build a model or find a causal link has turned up nothing. It’s a result that often disappointing and tough to communicate, but it’s crucial to the overall credibility of the field.
Interleaving

Interleaving

2019-07-2200:16:54

If you’re Google or Netflix, and you have a recommendation or search system as part of your bread and butter, what’s the best way to test improvements to your algorithm? A/B testing is the canonical answer for testing how users respond to software changes, but it gets tricky really fast to think about what an A/B test means in the context of an algorithm that returns a ranked list. That’s why we’re talking about interleaving this week—it’s a simple modification to A/B testing that makes it much easier to race two algorithms against each other and find the winner, and it allows you to do it with much less data than a traditional A/B test.Relevant links:https://medium.com/netflix-techblog/interleaving-in-online-experiments-at-netflix-a04ee392ec55https://www.microsoft.com/en-us/research/publication/predicting-search-satisfaction-metrics-with-interleaved-comparisons/https://www.cs.cornell.edu/people/tj/publications/joachims_02b.pdf
Federated Learning

Federated Learning

2019-07-1400:15:03

This is a re-release of an episode first released in May 2017.As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users interact with the algorithm? In other words, how do we do machine learning when the training dataset is distributed across many devices, imbalanced, and the usage associated with any one user needs to be obscured somewhat to protect the privacy of that user? Enter Federated Learning, a set of related algorithms from Google that are designed to help out in exactly this scenario. If you've used keyboard shortcuts or autocomplete on an Android phone, chances are you've encountered Federated Learning even if you didn't know it.
This is a re-release of an episode first released in February 2017.Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causality, but there's no reason to believe that big protests are actually randomly distributed. In other words, protest size is endogenous to legislative response, and understanding cause and effect is very challenging.So, what to do? Well, at least in the case of studying Tea Party protest effectiveness, researchers have used rainfall, of all things, to understand the impact of a big protest. In other words, rainfall is the instrumental variable in this analysis that cracks the scientific case open. What does rainfall have to do with protests? Do protests actually matter? What do we mean when we talk about endogenous and instrumental variables? We wouldn't be very good podcasters if we answered all those questions here--you gotta listen to this episode to find out.
Deepfakes

Deepfakes

2019-07-0100:15:081

Generative adversarial networks (GANs) are producing some of the most realistic artificial videos we’ve ever seen. These videos are usually called “deepfakes”. Even to an experienced eye, it can be a challenge to distinguish a fabricated video from a real one, which is an extraordinary challenge in an era when the truth of what you see on the news or especially on social media is worthy of skepticism. And just in case that wasn’t unsettling enough, the algorithms just keep getting better and more accessible—which means it just keeps getting easier to make completely fake, but real-looking, videos of celebrities, politicians, and perhaps even just regular people.Relevant links:http://lineardigressions.com/episodes/2016/5/28/neural-nets-play-cops-and-robbers-aka-generative-adversarial-networkshttp://fortune.com/2019/06/12/deepfake-mark-zuckerberg/https://www.youtube.com/watch?v=EfREntgxmDshttps://spectrum.ieee.org/tech-talk/robotics/artificial-intelligence/will-deepfakes-detection-be-ready-for-2020https://giorgiop.github.io/posts/2018/03/17/AI-and-digital-forgery/
The topic of bias in word embeddings gets yet another pass this week. It all started a few years ago, when an analogy task performed on Word2Vec embeddings showed some indications of gender bias around professions (as well as other forms of social bias getting reproduced in the algorithm’s embeddings). We covered the topic again a while later, covering methods for de-biasing embeddings to counteract this effect. And now we’re back, with a second pass on the original Word2Vec analogy task, but where the researchers deconstructed the “rules” of the analogies themselves and came to an interesting discovery: the bias seems to be, at least in part, an artifact of the analogy construction method. Intrigued? So were we…Relevant link:https://arxiv.org/abs/1905.09866
Attention in Neural Nets

Attention in Neural Nets

2019-06-1700:26:323

There’s been a lot of interest lately in the attention mechanism in neural nets—it’s got a colloquial name (who’s not familiar with the idea of “attention”?) but it’s more like a technical trick that’s been pivotal to some recent advances in computer vision and especially word embeddings. It’s an interesting example of trying out human-cognitive-ish ideas (like focusing consideration more on some inputs than others) in neural nets, and one of the more high-profile recent successes in playing around with neural net architectures for fun and profit.
loading
Comments (4)

Li Lae

Thx so much, both. Please keep up the good work!

Mar 28th
Reply

Nisarg Shah

Katie and Ben, you both have transformed my journey to learn about machine learning, which seemed impossible before. thanks for taking the time to share your knowledge and providing a fun path to beginners (can only speak for myself :))! I hope you continue this endeavor! we truly appreciate it!

Jul 2nd
Reply

Vikram Kulkarni

Katie should do it by herself, the stupid co host is annoying.

Mar 27th
Reply

m a

Vikram Kulkarni the idea is them having a dialogue and asking all questions a novice listener might have.

Apr 18th
Reply
loading
Download from Google Play
Download from App Store