AI explained: Open-source AI
Description
Reed Smith partners Howard Womersley Smith and Bryan Tan with AI Verify community manager Harish Pillay discuss why transparency and explain-ability in AI solutions are essential, especially for clients who will not accept a “black box” explanation. Subscribers to AI models claiming to be “open source” may be disappointed to learn the model had proprietary material mixed in, which might cause issues. The session describes a growing effort to learn how to track and understand the inputs used in AI systems training.
----more----
Transcript:
Intro: Hello and welcome to Tech Law Talks, a podcast brought to you by Reed Smith's Emerging Technologies Group. In each episode of this podcast, we will discuss cutting-edge issues on technology, data, and the law. We will provide practical observations on a wide variety of technology and data topics to give you quick and actionable tips to address the issues you are dealing with every day.
Bryan: Welcome to Tech Law Talks and our new series on artificial intelligence. Over the coming months, we'll explore the key challenges and opportunities within the rapidly evolving AI landscape. My name is Bryan Tan and I'm a partner at Reed Smith Singapore. Today we will focus on AI and open source software.
Howard: My name is Howard Womersley Smith. I'm a partner in the Emerging Technologies team of Reed Smith in London and New York. And I'm very pleased to be in this podcast today with Bryan and Harish.
Bryan: Great. And so today we have with us Mr. Harish Pillay. And before we start, I'm going to just ask Harish to tell us a little bit, well, not really a little bit, because he's done a lot about himself and how he got here.
Harish: Well, thanks, Bryan. Thanks, Howard. My name is Harish Pillay. I'm based here in Singapore, and I've been in the tech space for over 30 years. And I did a lot of things primarily in the open source world, both open source software, as well as in the hardware design and so on. So I've covered the spectrum. When I was way back in the graduate school, I did things in AI and chip design. That was in the late 1980s. And there was not much from an AI point of view that I could do then. It was the second winter for AI. But in the last few years, there was the resurgence in AI and the technologies and the opportunities that can happen with the newer ways of doing things with AI make a lot more sense. So now I'm part of an organization here in Singapore known as AI Verify Foundation. It is a non-profit open-source software foundation that was set up about a year ago to provide tools, software testing tools, to test AI solutions that people may be creating to understand whether those tools are fair, are unbiased, are transparent. There's about 11 criteria it tests against. So both traditional AI types of solutions as well as generative AI solutions. So these are the two open source projects that are globally available for anyone to participate in. So that's currently what I'm doing.
Bryan: Wow, that's really fascinating. Would you say, Harish, that kind of your experience over the, I guess, the three decades with the open source movement, with the whole Linux user groups, has that kind of culminated in this place where now there's an opportunity to kind of shape the development of AI in an open-source context?
Harish: I think we need to put some parameters around it as well. The AI that we talk about today could never have happened if it's not for open-source tools. That is plain and simple. So things like TensorFlow and all the tooling that goes around in trying to do the model building and so on and so forth could not have happened without open source tools and libraries, a Python library and a whole slew of other tools. If these were all dependent on non-open source solutions, we will still be talking about one fine day something is going to happen. So it's a given that that's the baseline. Now, what we need to do is to get this to the next level of understanding as to what does it mean when you say it's open source and artificial intelligence or open source AI, for that matter. Because now we have a different problem that we are trying to grapple with. The problem we're trying to grapple with is the definition of what is open-source AI. We understand open-source from a software point of view, from a hardware point of view. We understand that I have access to the code, I have access to the chip designs, and so on and so forth. No questions there. It's very clear to understand. But when you talk about generative AI as a specific instance of open-source AI, I can have access to the models. I can have access to the weights. I can do those kinds of stuff. But what was it that made those models become the models? Where were the data from? What's the data? What's the provenance of the data? Are these data openly available? Or are they hidden away somewhere? Understandably, we have a huge problem because in order to train the kind of models we're training today, it takes a significant amount of data and computing power to train the models. The average software developer does not have the resources to do that, like what we could do with a Linux environment or Apache or Firefox or anything like that. So there is this problem. So the question still comes back to is, what is open source AI? So the open source initiative, OSI, is now in the process of formulating what does it mean to have open source AI. The challenge we find today is that because of the success of open source in every sector of the industry, you find a lot of organizations now bending around and throwing around the label, our stuff is open source, our stuff is open source, when it is not. And they are conveniently using it as a means to gain attention and so on. No one is going to come and say, hey, do you have a proprietary tool? Adding that ship has sailed. It's not going to happen anymore. But the moment you say, oh, we have an open source fancy tool, oh, everybody wants to come and talk to you. But the way they craft that open source message is actually quite sadly disingenuous because they are putting restrictions on what you can actually do. It is contrary completely to what the open-source licensing says in open-source initiative. I'll pause there for a while because I threw a lot of stuff at you.
Bryan: No, no, no. That's a lot to unpack here, right? And there's a term I learned last week, and it's called AI washing. And that's where people try to bandy the terms, throw it together. It ends up representing something it's not. But that's fascinating. I think you talked a little bit about being able to see what's behind the AI. And I think that's kind of part of those 11 criteria that you talked about. I think auditability, transparency would be kind of one of those things. I think we're beginning to go into some of the challenges, kind of pitfalls that we need to look out for. But I'm going to just put a pause on that and I'm going to ask Howard to jump in with some questions on his phone. I think he's got some interesting questions for you also.
Howard: Yeah, thank you, Bryan. So, Harris, you spoke about the open source initiative, which we're very familiar with, and particularly the kind of guardrails that they're putting around what open source should be applied to AI systems. You've got a separate foundation. What's your view on where open source should feature in AI systems?
Harish: It's exactly the same as what OSI says. We are making no difference because the moment you make a distinction, then you bifurcate or you completely fragment the entire industry. You need to have a single perspective and a perspective that everybody buys into. It is a hard sell currently because not everybody agrees to the various components inside there, but there is good reasoning for some of the challenges. But at the same time, if that conversation doesn't happen, we have a problem. But from AI Verify Foundation perspective, it is our code that we make. Our code, interestingly, it's not an AI tool. It is a testing tool. It is written purely to test AI solutions. And it's on an Apache license. This is a no-brainer type of licensing perspective. It's not an AI solution in and of itself. It's just taking an input, run through the test, and spit out an output, and Mr. Developer, take that and do what you want with it.
Howard: Yeah, thank you for that. And what about your view on open source training data? I mean, that is really a bone of contention.
Harish: That is really where the problem comes in because I think we do have some open source trading data, like the Common Crawl data and a whole slew of different components there. So as long as you stick to those that have been publicly available and you then train your models based on that, or you take models that were trained based on that, I think we don't have any contention or any issue at the end of the day. You do whatever you want with it. The challenge happens when you mix the trading data, whether it was originally Common Crawl or any of the, you know, creative license content, and you mix it with non-licensed or licensed under proprietary stuff with no permission, and you mix it up, then we have a problem. And this is actually an issue that we have to collectively come to an agreement as to how to handle it. Now, should it be done on a two-tier basis? Should it be done with different nuances behind it? This is still a discussion that is ongoing, constantly ongoing. And OSI is taking the mother load of the weight to make this happen. And it's not an easy conversation to have because there's many perspectives. </p