DiscoverThe Python Data & Science Podcast.__init__
The Python Data & Science Podcast.__init__

The Python Data & Science Podcast.__init__

Author: Tobias Macey

Subscribed: 4,783Played: 80,000
Share

Description

The weekly podcast about Python and its use in science, research, and data. Tune in for engaging, educational, and technical discussions about machine learning, artificial intelligence, data science, and scientific research that is powered by Python.
323 Episodes
Reverse
The foundation of every ML model is the data that it is trained on. In many cases you will be working with tabular or unstructured information, but there is a growing trend toward networked, or graph data sets. Benedek Rozemberczki has focused his research and career around graph machine learning applications. In this episode he discusses the common sources of networked data, the challenges of working with graph data in machine learning projects, and describes the libraries that he has created to help him in his work. If you are dealing with connected data then this interview will provide a wealth of context and resources to improve your projects.
The growth of analytics has accelerated the use of SQL as a first class language. It has also grown the amount of collaboration involved in writing and maintaining SQL queries. With collaboration comes the inevitable variation in how queries are written, both structurally and stylistically which can lead to a significant amount of wasted time and energy during code review and employee onboarding. Alan Cruickshank was feeling the pain of this wasted effort first-hand which led him down the path of creating SQLFluff as a linter and formatter to enforce consistency and find bugs in the SQL code that he and his team were working with. In this episode he shares the story of how SQLFluff evolved from a simple hackathon project to an open source linter that is used across a range of companies and fosters a growing community of users and contributors. He explains how it has grown to support multiple dialects of SQL, as well as integrating with projects like DBT to handle templated queries. This is a great conversation about the long detours that are sometimes necessary to reach your original destination and the powerful impact that good tooling can have on team productivity.
Deep learning is gaining an immense amount of popularity due to the incredible results that it is able to offer with comparatively little effort. Because of this there are a number of engineers who are trying their hand at building machine learning models with the wealth of frameworks that are available. Andrew Ferlitsch wrote a book to capture the useful patterns and best practices for building models with deep learning to make it more approachable for newcomers ot the field. In this episode he shares his deep expertise and extensive experience in building and teaching machine learning across many companies and industries. This is an entertaining and educational conversation about how to build maintainable models across a variety of applications.
Unit tests are an important tool to ensure the proper functioning of your application, but writing them can be a chore. Stephan Lukasczyk wants to reduce the monotony of the process for Python developers. As part of his PhD research he created the Pynguin project to automate the creation of unit tests. In this episode he explains the complexity involved in generating useful tests for a dynamic language, how he has designed Pynguin to address the challenges, and how you can start using it today for your own work.
Natural language processing is a powerful tool for extracting insights from large volumes of text. With the growth of the internet and social platforms, and the increasing number of people and communities conducting their professional and personal activities online, the opportunities for NLP to create amazing insights and experiences are endless. In order to work with such a large and growing corpus it has become necessary to move beyond purely statistical methods and embrace the capabilities of deep learning, and transfer learning in particular. In this episode Paul Azunre shares his journey into the application and implementation of transfer learning for natural language processing. This is a fascinating look at the possibilities of emerging machine learning techniques for transforming the ways that we interact with technology.
Machine learning is a tool that has typically been performed on large volumes of data in one place. As more computing happens at the edge on mobile and low power devices, the learning is being federated which brings a new set of challenges. Daniel Beutel co-created the Flower framework to make federated learning more manageable. In this episode he shares his motivations for starting the project, how you can use it for your own work, and the unique challenges and benefits that this emerging model offers. This is a great exploration of the federated learning space and a framework that makes it more approachable.
Data exploration is an important step in any analysis or machine learning project. Visualizing the data that you are working with makes that exploration faster and more effective, but having to remember and write all of the code to build a scatter plot or histogram is tedious and time consuming. In order to eliminate that friction Doris Lee helped create the Lux project, which wraps your Pandas data frame and automatically generates a set of visualizations without you having to lift a finger. In this episode she explains how Lux works under the hood, what inspired her to create it in the first place, and how it can help you create a better end result. The Lux project is a valuable addition to the toolbox of anyone who is doing data wrangling with Pandas.
Any project that is used by more than one person will eventually need to handle permissions for each of those users. It is certainly possible to write that logic yourself, but you'll almost certainly do it wrong at least once. Rather than waste your time fighting with bugs in your authorization code it makes sense to use a well-maintained library that has already made and fixed all of the mistakes so that you don't have to. In this episode Sam Scott shares the Oso framework to give you a clean separation between your authorization policies and your application code. He explains how you can call a simple function to ask if something is allowed, and then manage the complex rules that match your particular needs as a separate concern. He describes the motivation for building a domain specific language based on logic programming for policy definitions, how it integrates with the host language (such as Python), and how you can start using it in your own applications today. This is a must listen even if you never use the project because it is a great exploration of all of the incidental complexity that is involved in permissions management.
Being able to present your ideas is one of the most valuable and powerful skills to have as a professional, regardless of your industry. For software engineers it is especially important to be able to communicate clearly and effectively because of the detail-oriented nature of the work. Unfortunately, many people who work in software are more comfortable in front of the keyboard than a crowd. In this episode Neil Thompson shares his story of being an accidental public speaker and how he is helping other engineers start down the road of being effective presenters. He discusses the benefits for your career, how to build the skills, and how to find opportunities to practice them. Even if you never want to speak at a conference, it's still worth your while to listen to Neil's advice and find ways to level up your presentation and speaking skills.
One of the great promises of computers is that they will make our work faster and easier, so why do we all spend so much time manually copying data from websites, or entering information into web forms, or any of the other tedious tasks that take up our time? As developers our first inclination is to "just write a script" to automate things, but how do you share that with your non-technical co-workers? In this episode Antti Karjalainen, CEO and co-founder of Robocorp, explains how Robotic Process Automation (RPA) can help us all cut down on time-wasting tasks and let the computers do what they're supposed to. He shares how he got involved in the RPA industry, his work with Robot Framework and RPA framework, how to build and distribute bots, and how to decide if a task is worth automating. If you're sick of spending your time on mind-numbing copy and paste then give this episode a listen and then let the robots do the work for you.
When you are writing code it is all to easy to introduce subtle bugs or leave behind unused code. Unused variables, unused imports, overly complex logic, etc. If you are careful and diligent you can find these problems yourself, but isn't that what computers are supposed to help you with? Thankfully Python has a wealth of tools that will work with you to keep your code clean and maintainable. In this episode Anthony Sottile explores Flake8, one of the most popular options for identifying those problematic lines of code. He shares how he became involved in the project and took over as maintainer and explains the different categories of code quality tooling and how Flake8 compares to other static analyzers. He also discusses the ecosystem of plugins that have grown up around it, including some detailed examples of how you can write your own (and why you might want to).
Writing code that is easy to read and understand will have a lasting impact on you and your teammates over the life of a project. Sometimes it can be difficult to identify opportunities for simplifying a block of code, especially if you are early in your journey as a developer. If you work with senior engineers they can help by pointing out ways to refactor your code to be more readable, but they aren't always available. Brendan Maginnis and Nick Thapen created Sourcery to act as a full time pair programmer sitting in your editor of choice, offering suggestions and automatically refactoring your Python code. In this episode they share their journey of building a tool to automatically find opportunities for refactoring in your code, including how it works under the hood, the types of refactoring that it supports currently, and how you can start using it in your own work today. It always pays to keep your tool box organized and your tools sharp and Sourcery is definitely worth adding to your repertoire.
Becoming data driven is the stated goal of a large and growing number of organizations. In order to achieve that mission they need a reliable and scalable method of accessing and analyzing the data that they have. While business intelligence solutions have been around for ages, they don't all work well with the systems that we rely on today and a majority of them are not open source. Superset is a Python powered platform for exploring your data and building rich interactive dashboards that gets the information that your organization needs in front of the people that need it. In this episode Maxime Beauchemin, the creator of Superset, shares how the project got started and why it has become such a widely used and popular option for exploring and sharing data at companies of all sizes. He also explains how it functions, how you can customize it to fit your specific needs, and how to get it up and running in your own environment.
Python is a language that is used in almost every imaginable context and by people from an amazing range of backgrounds. A lot of the people who use it wouldn't even call themselves programmers, because that is not the primary focus of their job. In this episode Chris Moffitt shares his experience writing Python as a business user. In order to share his insights and help others who have run up against the limits of Excel he maintains the site Practical Business Python where he publishes articles that help introduce newcomers to Python and explain how to perform tasks such as building reports, automating Excel files, and doing data analysis. This is a great conversation that illustrates how useful it is to learn Python even if you never intend to write software professionally.
There are a large and growing number of businesses built by and for data science and machine learning teams that rely on Python. Tony Liu is a venture investor who is following that market closely and betting on its continued success. In this episode he shares his own journey into the role of an investor and discusses what he is most excited about in the industry. He also explains what he looks at when investing in a business and gives advice on what potential founders and early employees of startups should be thinking about when starting on that journey.
Jupyter notebooks are a dominant tool for data scientists, but they lack a number of conveniences for building reusable and maintainable systems. For machine learning projects in particular there is a need for being able to pivot from exploring a particular dataset or problem to integrating that solution into a larger workflow. Rick Lamers and Yannick Perrenet were tired of struggling with one-off solutions when they created the Orchest platform. In this episode they explain how Orchest allows you to turn your notebooks into executable components that are integrated into a graph of execution for running end-to-end machine learning workflows.
When you are writing a script it can become unwieldy to understand how the logic and data are flowing through the program. To make this easier to follow you can use a flow-based approach to building your programs. Leonn Thomm created the Ryven project as an environment for visually constructing a flow-based program. In this episode he shares his inspiration for creating the Ryven project, how it changes the way you think about program design, how Ryven is implemented, and how to get started with it for your own programs.
One of the perennial challenges in software engineering is to reduce the opportunity for bugs to creep into the system. Some of the tools in our arsenal that help in this endeavor include rich type systems, static analysis, writing tests, well defined interfaces, and linting. Phillip Schanely created the CrossHair project in order to add another ally in the fight against broken code. It sits somewhere between type systems, automated test generation, and static analysis. In this episode he explains his motivation for creating it, how he uses it for his own projects, and how to start incorporating it into yours. He also discusses the utility of writing contracts for your functions, and the differences between property based testing and SMT solvers. This is an interesting and informative conversation about some of the more nuanced aspects of how to write well-behaved programs.
Collaborating on software projects is largely a solved problem, with a variety of hosted or self-managed platforms to choose from. For data science projects, collaboration is still an open question. There are a number of projects that aim to bring collaboration to data science, but they are all solving a different aspect of the problem. Dean Pleban and Guy Smoilovsky created DagsHub to give individuals and teams a place to store and version their code, data, and models. In this episode they explain how DagsHub is designed to make it easier to create and track machine learning experiments, and serve as a way to promote collaboration on open source data science projects.
Creating well designed software is largely a problem of context and understanding. The majority of programming environments rely on documentation, tests, and code being logically separated despite being contextually linked. In order to weave all of these concerns together there have been many efforts to create a literate programming environment. In this episode Jeremy Howard of fast.ai fame and Hamel Husain of GitHub share the work they have done on nbdev. The explain how it allows you to weave together documentation, code, and tests in the same context so that it is more natural to explore and build understanding when working on a project. It is built on top of the Jupyter environment, allowing you to take advantage of the other great elements of that ecosystem, and it provides a number of excellent out of the box features to reduce the friction in adopting good project hygiene, including continuous integration and well designed documentation sites. Regardless of whether you have been programming for 5 days, 5 years, or 5 decades you should take a look at nbdev to experience a different way of looking at your code.
loading
Comments (3)

Antonio Andrade

terrible audio this time

Jan 14th
Reply

Nihan Dip

this Masonite dude is so full of himself 😂

Sep 21st
Reply

Antonio Andrade

Tobias, are you a robot? nice postcast

May 27th
Reply
Download from Google Play
Download from App Store