Cathryn Carson & Fernando Perez, Part 1 of 2
Description
Cathryn Carson is an Assoc Prof of History, and the Ops Lead of the Social Sciences D- Lab at UC Berkeley. Fernando Perez is a research scientist at the Henry H. Wheeler Jr. Brain Imaging Center at U.C. Berkeley. Berkeley Institute for Data Science.
Transcript
Speaker 1: Spectrum's next.
Speaker 2: Okay. [inaudible] [inaudible].
Speaker 1: Welcome to spectrum the science [00:00:30 ] and technology show on k a l x Berkeley, a biweekly 30 minute program bringing you interviews featuring bay area scientists and technologists as well as a calendar of local events and news.
Speaker 3: Hi, good afternoon. My name is Brad Swift. I'm the host of today's show this week on spectrum we present part one of our two part series on big data at cal. The Berkeley Institute for Data Science or bids is only [00:01:00 ] four months old. Two people involved with shaping the institute are Catherine Carson and Fernando Perez and they are our guests. Catherine Carson is an associate professor of history and associate dean of social sciences and the operational lead of the social sciences data lab at UC Berkeley. Fernando Perez is a research scientist at the Henry H. Wheeler Jr Brain imaging center at UC Berkeley. He created the ipython project while a graduate student in 2001 [00:01:30 ] and continues to lead the project here is part one, Catherine Carson and Fernando Perez. Welcome to spectrum. Thanks for having us and I wanted to get from both of you a little bit of a short summary about the work you're doing now that you just sort of your activity that predates your interest in data science.
Speaker 4: Data Science is kind of an Ale defined term I think and it's still an open question precisely what it is, but in a certain sense all of my research has been probably under the umbrella [00:02:00 ] of what we call today data science since the start. I did my phd in particle physics but it was computational in particle physics and I was doing data analysis in that case of models that were competitionally created. So I've sort of been doing this really since I was a graduate student. What has changed over time is the breadth of disciplines that are interested in these kinds of problems in these kinds of tools and that have these kinds of questions. In physics. This has been kind of a common way of working on writing for a long time. Sort of the deep intersection [00:02:30 ] between computational tools and large data sets, whether they were created by models or collected experimentally is something that has a long history in physics.
Speaker 4: How long the first computers were created to solve differential equations, to plot the trajectories of ballistic missiles. I was one of the very first tasks that's computers were created for so almost since the dawn of coats and so it's really only recently though that the size of the data sets has really jumped. Yes, the size has grown very, [00:03:00 ] very large in the last couple of decades, especially in the last decade, but I think it's important to not get too hung up on the issue of size because I think when we talk about data science, I like to define it rather in the context of data that is large for the traditional framework tools and conceptual kind of structure of a given discipline rather than it's raw absolute size because yes, in physics for example, we have some of the largest data sets in existence, things like what the LHC creates [00:03:30 ] for the Higgs Boson. Those data sets are just absolute, absurdly large, but in a given discipline, five megabytes of data might be a lot depending on what it is that you're trying to ask. And so I think it's more, it's much, much more important to think of data that has grown larger than a given discipline was used in manipulating and that therefore poses interesting challenges for that given domain rather than being completely focused on the raw size of the data.
Speaker 1: I approached this from an angle that's actually complimentary to Fernando in part because [00:04:00 ] my job as the interim director of the social sciences data laboratory is not to do data science but to provide the infrastructure, the setting for researchers across the social sciences here who are doing that for themselves. And exactly in the social sciences you see a nice exemplification of the challenge of larger sizes of data than were previously used and new kinds of data as well. So the social sciences are starting to pick up say on [00:04:30 ] sensor data that has been placed in environmental settings in order to monitor human behavior. And social scientists can then use that in order to design tests around it or to develop ways of interpreting it to answer research questions that are not necessarily anticipated by the folks who put the sensors in place or accessing data that comes out of human interactions online, which is created for entirely different purposes [00:05:00 ] but makes it possible for social scientists to understand things about human social networks.
Speaker 1: So the challenges of building capacity for disciplines to move into new scales of data sets and new kinds of data sets. So one of the ones that I've been seeing as I've been building up d lab and that we've jointly been seeing as we tried to help scope out what the task of the Berkeley Institute for data science is going to be. How about the emergence [00:05:30 ] of data science? Do you have a sense of the timeline when you started to take note of its feasibility for social sciences? Irrespective of physics, which has a longer history. One of the places that's been driving the conversations in social sciences, actually the funding regime in that the existing beautifully curated data sets that we have from the post World War Two period survey data, principally administrative data on top of that, [00:06:00 ] those are extremely expensive to produce and to curate and maintain.
Speaker 1: And as the social sciences in the last only five to 10 years have been weighing the portfolio of data sources that are supported by funding agencies. We've been forced to confront the fact that the maintenance of the post World War Two regime of surveying may not be feasible into the future and that we're going to have to be shifting to other kinds of data that are generated [00:06:30 ] for other purposes and repurposing and reusing it, finding new ways to, to cut it and slice it in order to answer new kinds of questions that weren't also accessible to the old surveys. So one way to approach it is through the infrastructure that's needed to generate the data that we're looking at. Another way is simply to look at the infrastructure on campus. One of the launching impetuses for the social sciences data laboratory was in fact the budget cuts of 2009 [00:07:00 ] here on campus. When we acknowledged that if we were going to support cutting edge methodologically innovative social science on this campus, that we were going to need to find ways to repurpose existing assets and redirect them towards whatever this new frontier in social science is going to be.
Speaker 5: You were listening to spectrum on k a l x Berkeley, Catherine Carson and Fernando Perez, our guests. [00:07:30 ] They are part of the Berkeley Institute for data science known as big [inaudible].
Speaker 4: Fernando, you sort of gave us a generalized definition of data science. Do you want to give it another go just in case you evoke something else? Sure. I want to leave that question slightly on answer because I feel that to some extent, one of the challenges we have as an intellectual effort that we're trying to tackle at the Brooklyn [00:08:00 ] instead for data science is precisely working on what this field is. Right. I don't want to presuppose that we have a final answer on this question, but at least we, we do know that we have some elements to frame the question and I think it's mostly about an intersection. It's about an intersection of things that were being done already on their own, but that were being done often in isolation. So it's the intersection of methodological work whereby that, I mean things like statistical theory, applied mathematics, computer science, [00:08:30 ] algorithm development, all of the computational and theoretical mathematical machinery that has been done traditionally, the questions arising from domain disciplines that may have models that may have data sets, that may have sensors that may have a telescope or that may have a gene sequencing array and where are they have their own theoretical models of their organisms or galaxies or whatever it is and where that data can be inscribed and the fact that tools need to be built.
Speaker 4: Does data doesn't get analyzed by blackboards? Those data gets analyzed by software, but this is software that is deeply woven [00:09:00 ] into the fabric of these other two spaces, right? It's software that has to be written with the knowledge of the questions and the discipline and the domain and also with the knowledge of the methodology, the theory. It's that intersection of this triad of things of concrete representation in computational machinery, abstract ideas and methodologies and domain questions that in many ways creates something new when the work has to be done simultaneously with enough depth and enough rigor on all [00:09:30 ] of these three directions and precisely that intersection is where now the bottleneck is proving to be because you can have the ideas, you can have the questions, you can have the data, you can h