DiscoverO'Reilly Data Show - O'Reilly Media Podcast
O'Reilly Data Show - O'Reilly Media Podcast
Claim Ownership

O'Reilly Data Show - O'Reilly Media Podcast

Author: O'Reilly Media

Subscribed: 3,026Played: 25,005


The O'Reilly Data Show explores the opportunities and techniques driving big data, data science, and AI. Through interviews and analysis, we highlight the people putting data to work.
111 Episodes
The O’Reilly Data Show Podcast: Roger Chen on the fair value and decentralized governance of data.In this episode of the Data Show, I spoke with Roger Chen, co-founder and CEO of Computable Labs, a startup focused on building tools for the creation of data networks and data exchanges. Chen has also served as co-chair of O'Reilly's Artificial Intelligence Conference since its inception in 2016. This conversation took place the day after Chen and his collaborators released an interesting new white paper, Fair value and decentralized governance of data. Current-generation AI and machine learning technologies rely on large amounts of data, and to the extent they can use their large user bases to create “data silos,” large companies in large countries (like the U.S. and China) enjoy a competitive advantage. With that said, we are awash in articles about the dangers posed by these data silos. Privacy and security, disinformation, bias, and a lack of transparency and control are just some of the issues that have plagued the perceived owners of “data monopolies.”In recent years, researchers and practitioners have begun building tools focused on helping organizations acquire, build, and share high-quality data. Chen and his collaborators are doing some of the most interesting work in this space, and I recommend their new white paper and accompanying open source projects.Sequence of basic market transactions in the Computable Labs protocol. Source: Roger Chen, used with permission.We had a great conversation spanning many topics, including:Why he chose to focus on data governance and data markets.The unique and fundamental challenges in accurately pricing data.The importance of data lineage and provenance, and the approach they took in their proposed protocol.What cooperative governance is and why it's necessary.How their protocol discourages an unscrupulous user from just scraping all data available in a data market.Related resources:Roger Chen: “Data liquidity in the age of inference”Ihab Ilyas and Ben lorica on “The quest for high-quality data”Chris Ré: “Software 2.0 and Snorkel”Alex Ratner on “Creating large training data sets quickly”Jeff Jonas on “Real-time entity resolution made accessible”“Data collection and data markets in the age of privacy and machine learning”Guillaume Chaslot on “The importance of transparency and user control in machine learning”
The O'Reilly Data Show: Ben Lorica chats with Jeff Meyerson of Software Engineering Daily about data engineering, data architecture and infrastructure, and machine learning.In this week's episode of the Data Show, we're featuring an interview Data Show host Ben Lorica participated in for the Software Engineering Daily Podcast, where he was interviewed by Jeff Meyerson. Their conversation mainly centered around data engineering, data architecture and infrastructure, and machine learning (ML).Here are a few highlights:Tools for productive collaborationA data catalog, at a high level, basically answers questions around the data that's available and who is using it so an enterprise can understand access patterns. ... The term "data catalog" is generally used when you've gotten to the point where you have a team of data scientists and you need a place where they can use libraries in a setting where they can collaborate, and where they can share not only models but maybe even data pipelines and features. The more advanced data science platforms will have automation tools built in. ... The ideal scenario is the data science platform is not just for prototyping, but also for pushing things to production.Tools for ML developmentWe have tools for software development, and now we're beginning to hear about tools for machine learning development—there's a company here at Strata called, and there's another startup called But what has really caught my attention is an open source project from Databricks called MLflow. When it first came out, I thought, 'Oh, yeah, so we don't have anything like this. Might have a decent chance of success.' But I didn't pay close attention until recently; fast forward to today, there are 80 contributors for 40 companies and 200+ companies using it.What's good about MLflow is that it has three components and you're free to pick and choose—you can use one, two, or three. Based on their surveys, the most popular component is the one for tracking and managing machine learning experiments. It's designed to be useful for individual data scientists, but it's also designed to be used by teams of data scientists, so they have documented use-cases of MLflow where you have a company managing thousands of models and productions.
The O’Reilly Data Show Podcast: Nick Pentreath on overcoming challenges in productionizing machine learning models.In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines.We had a great conversation spanning many topics, including:AI Fairness 360 (AIF360), a set of fairness metrics for data sets and machine learning models.Adversarial Robustness Toolbox (ART), a Python library for adversarial attacks and defenses.Model Asset eXchange (MAX), a curated and standardized collection of free and open source deep learning models.Tools for model development, governance, and operations, including MLflow, Seldon Core, and Fabric for deep learningReinforcement learning in the enterprise, and the emergence of relevant open source tools like Ray.Related resources:"Modern Deep Learning: Tools and Techniques"—a new tutorial at the Artificial Intelligence conference in San JoseHarish Doddi on “Simplifying machine learning lifecycle management”Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models”“Managing risk in machine learning”: considerations for a world where ML models are becoming mission critical“The evolution and expanding utility of Ray”"Local Interpretable Model-Agnostic Explanations (LIME): An Introduction”Forough Poursabzi Sangdeh on why “It’s time for data scientists to collaborate with researchers in other disciplines”
The O’Reilly Data Show Podcast: Dhruba Borthakur and Shruti Bhat on enabling interactive analytics and data applications against live data.In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing executive focused on enterprise software and data products. Their new startup is focused on a few trends I’ve recently been thinking about, including the re-emergence of real-time analytics, and the hunger for simpler data architectures and tools.  Borthakur exemplifies the need for companies to continually evaluate new technologies: while he was the founding engineer for HDFS, these days he mostly works with object stores like S3.We had a great conversation spanning many topics, including:RocksDB, an open source, embeddable key-value store originated by Facebook, and which is used in several other open source projects.Time-series databases.The importance of having solutions for real-time analytics, particularly now with the renewed interest in IoT applications and rollout of 5G technologies.Use cases for Rockset’s technologies—and more generally, applications of real-time analytics.The Aggregator Leaf Tailer architecture as an alternative to the Lambda architecture.Building data infrastructure in the cloud.The Aggregator Leaf Tailer (“CQRS for the data world”): A data architecture favored by web-scale companies. Source: Dhruba Borthakur, used with permission.Related resources:Serverless Streaming Architectures & Algorithms for the Enterprise - a new tutorial on September 24th at Strata Data NYC.“Becoming a machine learning company means investing in foundational technologies”Haoyuan Li: “In the age of AI, fundamental value resides in data”Harish Doddi: “Simplifying machine learning lifecycle management”Eric Jonas: “A Berkeley view on serverless computing”“Specialized tools for machine learning development and model governance are becoming essential”Avner Braaverman: “What data scientists and data engineers can do with current generation serverless technologies”
The O’Reilly Data Show Podcast: Jike Chong on the many exciting opportunities for data professionals in the U.S. and China.In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China. We had a great conversation spanning many topics, including:Potential applications of data science in financial services.The current state of data science in financial services in both the U.S. and China.His experience recruiting, training, and managing data science teams in both the U.S. and China.Here are some highlights from our conversation:Opportunities in financial servicesThere's a customer acquisition piece and then there's a customer retention piece. For customer acquisition, we can see that new technologies can really add value by looking at all sorts of data sources that can help a financial service company identify who they want to target to provide those services. So, it's a great place where data science can help find the product market fit, not just at one instance like identifying who you want to target, but also in a continuous form where you can evolve a product and then continuously find the audience that would best fit the product and continue to analyze the audience so you can design the next generation product. ... Once you have a specific cohort of users who you want to target, there's a need to be able to precisely convert them, which means understanding the stage of the customer's thought process and understanding how to form the narrative to convince the user or the customer that a particular piece of technology or particular piece of service is the current service they need.... On the customer serving or retention side, for financial services we commonly talk about building hundred-year businesses, right? They have to be profitable businesses, and for financial service to be profitable, there are operational considerations—quantifying risk requires a lot of data science; preventing fraud is really important, and there is garnering the long-term trust with the customer so they stay with you, which means having the work ethic to be able to take care of customer's data and able to serve the customer better with automated services whenever and wherever the customer is. It's all those opportunities where I see we can help serve the customer by having the right services presented to them and being able to serve them in the long term.Opportunities in ChinaA few important areas in the financial space in China include mobile payments, wealth management, lending, and insurance—basically, the major areas for the financial industry.For these areas, China may be a forerunner in using internet technologies, especially mobile internet technologies for FinTech, and I think the wave started way back in the 2012/2013 time frame. If you look at mobile payments, like Alipay and WeChat, those have hundreds of millions of active users. The latest data from Alipay is about 608 million users, and these are monthly active users we're talking about. This is about two times the U.S. population actively using Alipay on a monthly basis, which is a crazy number if you consider all the data that can generate and all the things you can see people buying to be able to understand how to serve the users better.If you look at WeChat, they're boasting one billion users, monthly active users, early this year. Those are the huge players, and with that amount of traffic, they are able to generate a lot of interest for the lower-frequency services like wealth management and lending, as well as insurance.Related resources:Kai-Fu Lee outlines the factors that enabled China's rapid ascension in AIGary Kazantsev on how “Data science makes an impact on Wall Street”Juan Huerta on “Upcoming challenges and opportunities for data technologies in consumer finance”Geoffrey Bradway on “Programming collective intelligence for financial trading”Jason Dai on why “Companies in China are moving quickly to embrace AI technologies”Haoyuan Li on why “In the age of AI, fundamental value resides in data”
The O’Reilly Data Show Podcast: Jeff Jonas on the evolution of entity resolution technologies.In this episode of the Data Show, I spoke with Jeff Jonas, CEO, founder and chief scientist of Senzing, a startup focused on making real-time entity resolution technologies broadly accessible. He was previously a fellow and chief scientist of context computing at IBM. Entity resolution (ER) refers to techniques and tools for identifying and linking manifestations of the same entity/object/individual. Ironically, ER itself has many different names (e.g., record linkage, duplicate detection, object consolidation/reconciliation, etc.).ER is an essential first step in many domains, including marketing (cleaning up databases), law enforcement (background checks and counterterrorism), and financial services and investing. Knowing exactly who your customers are is an important task for security, fraud detection, marketing, and personalization. The proliferation of data sources and services has made ER very challenging in the internet age. In addition, many applications now increasingly require near real-time entity resolution.We had a great conversation spanning many topics including:Why ER is interesting and challengingHow ER technologies have evolved over the yearsHow Senzing is working to democratize ER by making real-time AI technologies accessible to developersSome early use cases for Senzing’s technologiesSome items on their research agendaHere are a few highlights from our conversation:Entity Resolution through yearsIn the early '90s, I worked on a much more advanced version of entity resolution for the casinos in Las Vegas and created software called NORA, non-obvious relationship awareness. Its purpose was to help casinos better understand who they were doing business with. We would ingest data from the loyalty club, everybody making hotel reservations, people showing up without reservations, everybody applying for jobs, people terminated, vendors, and 18 different lists of different kinds of bad people, some of them card counters (which aren't that bad), some cheaters. And they wanted to figure out across all these identities when somebody was the same, and then when people were related. Some people were using 32 different names and a bunch of different social security numbers.... Ultimately, IBM bought my company and this technology became what is known now at IBM as “identity insight.” Identity insight is a real-time entity resolution engine that gets used to solve many kinds of problems. MoneyGram implemented it and their fraud complaints dropped 72%. They saved a few hundred million just in their first few years.... But while at IBM, I had a grand vision about a new type of entity resolution engine that would have been unlike anything that's ever existed. It's almost like a Swiss Army knife for ER.Recent developmentsThe Senzing entity resolution engine works really well on two records from a domain that you've never even seen before. Say you've never done entity resolution on restaurants from Singapore. The first two records you feed it, it's really, really already smart. And then as you feed it more data, it gets smarter and smarter.... So, there are two things that we've intertwined. One is common sense. One type of common sense is the names—Dick, Dickie, Richie, Rick, Ricardo are all part of the same name family. Why should it have to study millions and millions of records to learn that again?... Next to common sense, there's real-time learning. In real-time learning, we do a few things. You might have somebody named Bob, but who now goes by a nickname or an alias of Andy. Eventually, you might come to learn that. So, now you know you have to learn over time that Bob also has this nickname, and Bob lived at three addresses, and this is his credit card number, and now he's got four phone numbers. So you want to learn those over time. ... These systems we're creating, our entity resolution systems—which really resolve entities and graph them (call it index of identities and how they're related)—never has to be reloaded. It literally cleans itself up in the past. You can do maintenance on it while you're querying it, while you're loading new transactional data, while you're loading historical data. There's nothing else like it that can work at this scale. It's really hard to do.Related resources:Jeff Jonas on “Context Computing”David Ferrucci on why “Language understanding remains one of AI’s grand challenges”David Blei on “Topic models: Past, present, and future”“Lessons learned building natural language processing systems in health care”“Building a contacts graph from activity data”“Customer record deduplication using Spark and Reifier”
The O’Reilly Data Show Podcast: Neelesh Salian on data lineage, data governance, and evolving data platforms.In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up.There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.Here are some highlights from our conversation:Data lineageData lineage is not something new. It's something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I'm describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful.... Think about data lineage as helping issues about quality of data, understanding if something is corrupted. On the security side, think of GDPR ... which was one of the hot topics I heard about at the Strata Data Conference in London in 2018.Why companies are suddenly building data lineage solutionsA data lineage system becomes necessary as time progresses. It becomes easier for maintainability. You need it for audit trails, for security and compliance. But you also need to think of the benefit of managing the data sets you're working with. If you're working with 10 databases, you need to know what's going on in them. If I have to give you a vision of a data lineage system, think of it as a final graph or view of some data set, and it shows you a graph of what it's linked to. Then it gives you some metadata information so you can drill down. Let's say you have corrupted data, let's say you want to debug something. All these cases tie into the actual use cases for which we want to build it.Related resources:“Deep automation in machine learning”Vitaly Gordon on “Building tools for enterprise data science”“Managing risk in machine learning”Haoyuan Li explains why “In the age of AI, fundamental value resides in data”“What machine learning means for software development”Joe Hellerstein on how "Metadata services can lead to performance and organizational improvements"
The O’Reilly Data Show Podcast: Avner Braverman on what’s missing from serverless today and what users should expect in the near future.In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.Serverless is clearly on the radar of data engineers and architects. In a recent survey, we found 85% of respondents already had parts of their data infrastructure in one of the public clouds, and 38% were already using at least one of the serverless offerings we listed. As more serverless offerings get rolled out—e.g., things like PyWren that target scientists—I expect these numbers to rise.We had a great conversation spanning many topics, including:A short history of cloud computing.The fundamental differences between serverless and conventional cloud computing. The reasons serverless—specifically AWS Lambda—took off so quickly.What can data scientists and data engineers do with the current generation serverless offerings.What is missing from serverless today and what should users expect in the near future.Related resources:“The evolution and expanding utility of Ray”Results of a new survey: “Evolving Data Infrastructure: Tools and Best Practices for Advanced Analytics and AI”Eric Jonas on “Building accessible tools for large-scale computation and machine learning”“7 data trends on our radar”“Handling real-time data operations in the enterprise”“Progress for big data in Kubernetes”
The O’Reilly Data Show Podcast: Forough Poursabzi Sangdeh on the interdisciplinary nature of interpretable and interactive machine learning.In this episode of the Data Show, I spoke with Forough Poursabzi-Sangdeh, a postdoctoral researcher at Microsoft Research New York City. Poursabzi works in the interdisciplinary area of interpretable and interactive machine learning. As models and algorithms become more widespread, many important considerations are becoming active research areas: fairness and bias, safety and reliability, security and privacy, and Poursabzi’s area of focus—explainability and interpretability.We had a great conversation spanning many topics, including:Current best practices and state-of-the-art methods used to explain or interpret deep learning—or, more generally, machine learning models.The limitations of current model interpretability methods.The lack of clear/standard metrics for comparing different approaches used for model interpretabilityMany current AI and machine learning applications augment humans, and, thus, Poursabzi believes it’s important for data scientists to work closely with researchers in other disciplines.The importance of using human subjects in model interpretability studies. Related resources:"Local Interpretable Model-Agnostic Explanations (LIME): An Introduction”“Interpreting predictive models with Skater: Unboxing model opacity”Jacob Ward on “How social science research can inform the design of AI systems”Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models”“Managing risk in machine learning”: considerations for a world where ML models are becoming mission criticalFrancesca Lazzeri and Jaya Mathew on “Lessons learned while helping enterprises adopt machine learning”Jerry Overton on “Teaching and implementing data science and AI in the enterprise”
The O’Reilly Data Show Podcast: Kartik Hosanagar on the growing power and sophistication of algorithms.In this episode of the Data Show, I spoke with Kartik Hosanagar, professor of technology and digital business, and professor of marketing at The Wharton School of the University of Pennsylvania.  Hosanagar is also the author of a newly released book, A Human’s Guide to Machine Intelligence, an interesting tour through the recent evolution of AI applications that draws from his extensive experience at the intersection of business and technology.We had a great conversation spanning many topics, including:The types of unanticipated consequences of which algorithm designers should be aware.The predictability-resilience paradox: as systems become more intelligent and dynamic, they also become more unpredictable, so there are trade-offs algorithms designers must face.Managing risk in machine learning: AI application designers need to weigh considerations such as fairness, security, privacy, explainability, safety, and reliability.A bill of rights for humans impacted by the growing power and sophistication of algorithms.Some best practices for bringing AI into the enterprise.Related resources:“Managing risk in machine learning”: considerations for a world where ML models are becoming mission criticalFrancesca Lazzeri and Jaya Mathew on “Lessons learned while helping enterprises adopt machine learning”Jerry Overton on “Teaching and implementing data science and AI in the enterprise”Kris Hammond on “Bringing AI into the enterprise”Jacob Ward on “How social science research can inform the design of AI systems”“Overcoming barriers to AI adoption”Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models”
Download from Google Play
Download from App Store