The confluence of sensing, communication and computing technologies is allowing capture and access to data, in diverse forms and modalities, in ways that were unimaginable even a few years ago. These include data that afford the analysis and interpretation of multimodal cues of verbal and non-verbal human behavior to facilitate human behavioral research and its translational applications. They carry crucial information about a person’s intent, identity and trait but also underlying attitudes and emotions. Automatically capturing these cues, although vastly challenging, offers the promise of not just efficient data processing but in tools for discovery that enable hitherto unimagined scientific insights, and means for supporting diagnostics and interventions. Recent computational approaches that have leveraged judicious use of both data and knowledge have yielded significant advances in this regards, for example in deriving rich, context-aware information from multimodal signal sources including human speech, language, and videos of behavior. These are even complemented and integrated with data about human brain and body physiology. This talk will focus on some of the advances and challenges in gathering such data and creating algorithms for machine processing of such cues. It will highlight some of our ongoing efforts in Behavioral Signal Processing (BSP)—technology and algorithms for quantitatively and objectively understanding typical, atypical and distressed human behavior—with a specific focus on communicative, affective and social behavior. The talk will illustrate Behavioral Informatics applications of these techniques that contribute to quantifying higher-level, often subjectively described, human behavior in a domain-sensitive fashion. Examples will be drawn from mental health and well being realms such as Autism Spectrum Disorders, Couple therapy, Depression and Addiction counseling. Shrikanth (Shri) Narayanan is Andrew J. Viterbi Professor of Engineering at the University of Southern California, where he is Professor of Electrical Engineering, and jointly in Computer Science, Linguistics, Psychology, Neuroscience and Pediatrics, and Director of the Ming Hsieh Institute. Prior to USC he was with AT&T Bell Labs and AT&T Research. His research focuses on human-centered information processing and communication technologies. He is a Fellow of the Acoustical Society of America, IEEE, and the American Association for the Advancement of Science (AAAS). Shri Narayanan is Editor in Chief for IEEE Journal on Selected Topics in Signal Processing, an Editor for the Computer, Speech and Language Journal and an Associate Editor for the IEEE Transactions on Affective Computing, the Journal of Acoustical Society of America, and the APISPA Transactions on Signal and Information Processing having previously served an Associate Editor for the IEEE Transactions of Speech and Audio Processing (2000-2004), the IEEE Signal Processing Magazine (2005-2008) and the IEEE Transactions on Multimedia (2008-2012). He is a recipient of several honors including the 2015 Engineers Council’s Distinguished Educator Award, the 2005 and 2009 Best Transactions Paper awards from the IEEE Signal Processing Society and serving as its Distinguished Lecturer for 2010-11, and as an ISCA Distinguished Lecturer for 2015-16. With his students, he has received a number of best paper awards including a 2014 Ten-year Technical Impact Award from ACM ICMI and Interspeech Challenges in 2009 (Emotion classification), 2011 (Speaker state classification), 2012 (Speaker trait classification), 2013 (Paralinguistics/Social Signals), 2014 (Paralinguistics/Cognitive Load) and in 2015 (Non-nativeness detection). He has published over 650 papers and has been granted 17 U.S. patents.
Human-like Singing and Talking Machines: Flexible Speech Synthesis in Karaoke, Anime, Smart Phones, Video Games, Digital Signage, TV and Radio Programs This talk will give an overview of statistical approach to flexible speech synthesis. For constructing human-like talking machines, speech synthesis systems are required to have an ability to generate speech with arbitrary speaker's voice, various speaking styles in different languages, varying emphasis and focus, and/or emotional expressions. The main advantage of the statistical approach is that such flexibility can easily be realized using mathematically well-defined algorithms. In this talk, the system architecture is outlined and then recent results and demos will be presented. Keiichi Tokuda is a Professor in the Department of Computer Science at Nagoya Institute of Technology and currently he is visiting Google on sabbatical. He is also an Honorary Professor at the University of Edinburgh. He was an Invited Researcher at the National Institute of Information and Communications Technology (NICT), formally known as the ATR Spoken Language Communication Research Laboratories, Kyoto, Japan from 2000 to 2013, and was a Visiting Researcher at Carnegie Mellon University from 2001 to 2002. He has been working on statistical parametric speech synthesis after he proposed an algorithm for speech parameter generation from HMM in 1995. He received six paper awards and two achievement awards. He is an IEEE Fellow and an ISCA Fellow.
Clarification in Spoken Dialogue Systems such as in mobile applications often consists of simple requests to “Please repeat” or “Please rephrase” when the system fails to understand a word or phrase. However, human-human dialogues rarely include such questions. When humans ask for clarification of user input such as “I want to travel on XXX”, they typically use targeted clarification questions, such as “When do you want to travel?” However, systems frequently make mistakes when they try to behave more like humans, sometimes asking inappropriate clarification questions. We present research on more human-like clarification behavior based on a series of crowd-sourcing experiments whose results are implemented in a speech-to-speech translation system. We also describe strategies for detecting when our system has asked the ‘wrong’ question of a user, based upon features of the user’s response. Julia Hirschberg is Percy K. and Vida L. W. Hudson Professor of Computer Science and Chair of the Computer Science Department at Columbia University. She worked at Bell Laboratories and AT&T Laboratories -- Research from 1985-2003 as a Member of Technical Staff and a Department Head, creating the Human-Computer Interface Research Department in 1994. She served as editor-in-chief of Computational Linguistics from 1993-2003 and co-editor-in-chief of Speech Communication from 2003-2006. She served on the Executive Board of the Association for Computational Linguistics (ACL) from 1993-2003, on the Permanent Council of International Conference on Spoken Language Processing (ICSLP) since 1996, and on the board of the International Speech Communication Association (ISCA) from 1999-2007 (as President 2005-2007); she has served on the CRA Executive Board (2013-14). She now serves on the IEEE Speech and Language Processing Technical Committee, the Association for the Advancement of Artificial Intelligence (AAAI) Council, the Executive Board of the North American ACL, and the board of the CRA-W. She has been an AAAI fellow since 1994, an ISCA Fellow since 2008, and a (founding) ACL Fellow since 2011, and was elected to the American Philosophical Society in 2014. She is a winner of the IEEE James L. Flanagan Speech and Audio Processing Award (2011) and the ISCA Medal for Scientific Achievement (2011).
Automatically extracting social meaning from language is one of the most exciting challenges in natural language understanding. In this talk I’ll summarize a number of recent results using the tools of natural language processing to help extract and understand social meaning from texts of different sorts. We’ll explore the relationship between language, economics and social psychology in the automatic processing of the language of restaurant menus and reviews. And I’ll show how natural language processing can help model different aspects of the spread of innovation through communities: how interdisciplinarity plays a crucial role in the spread of scientific innovation, and how the spread of linguistic innovation is intricately tied up with people's lifecycle in online communities. Dan Jurafsky is Professor and Chair of Linguistics, and Professor of Computer Science, at Stanford University. He is the co-author of the widely-used textbook "Speech and Language Processing”, co-created one of the first massively open online courses, Stanford’s course in Natural Language Processing, and is the recipient of a 2002 MacArthur Fellowship. His trade book “The Language of Food” comes out September 2014. His research focuses on computational linguistics, and its application to the social and behavioral sciences.
What effect does language have on people, and what effect do people have on language? You might say in response, "Who are you to discuss these problems?" and you would be right to do so; these are Major Questions that science has been tackling for many years. But as a field, I think natural language processing and computational linguistics have much to contribute to the conversation, and I hope to encourage the community to further address these issues. To this end, I'll describe two efforts I've been involved in. The first project provides evidence that in group discussions, power differentials between participants are subtly revealed by how much one individual immediately echoes the linguistic style of the person they are responding to. We consider multiple types of power: status differences (which are relatively static), and dependence (a more ''situational'' relationship). Using a precise probabilistic formulation of the notion of linguistic coordination, we study how conversational behavior can reveal power relationships in two very different settings: discussions among Wikipedians and arguments before the U.S. Supreme Court. Our second project is motivated by the question of what information achieves widespread public awareness. We consider whether, and how, the way in which the information is phrased --- the choice of words and sentence structure --- can affect this process. We introduce an experimental paradigm that seeks to separate contextual from language effects, using movie quotes as our test case. We find that there are significant differences between memorable and non-memorable quotes in several key dimensions, even after controlling for situational and contextual factors. One example is lexical distinctiveness: in aggregate, memorable quotes use less common word choices (as measured by statistical language models), but at the same time are built upon a scaffolding of common syntactic patterns. Joint work with Justin Cheng, Cristian Danescu-Niculescu-Mizil, Jon Kleinberg, and Bo Pang. Lillian Lee is a professor of computer science at Cornell University. Her research interests include natural language processing, information retrieval, and machine learning. She is the recipient of the inaugural Best Paper Award at HLT-NAACL 2004 (joint with Regina Barzilay), a citation in "Top Picks: Technology Research Advances of 2004" by Technology Research News (also joint with Regina Barzilay), and an Alfred P. Sloan Research Fellowship; and in 2013, she was named a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI). Her group's work has received several mentions in the popular press, including The New York Times, NPR's All Things Considered, and NBC's The Today Show.
As speech recognition continues to improve, new applications of the technology have been enabled. It is now common to search for information and send accurate short messages by speaking into a cellphone - something completely impractical just a few years ago. Another application that has recently been gaining attention is "Spoken Term Detection" - using speech recognition technology to locate key words or phrases of interest in running speech of variable quality. Spoken Term Detection can be used to issue real time alerts, rapidly identify multimedia clips of interesting content, and, when combined with search technology, even provide real-time commentary during broadcasts and meetings. This talk will describe the basics of Spoken Term Detection systems, including recent advances in core speech recognition technology, performance metrics, how out-of-vocabulary queries are handled, and ways of using score normalization and system combination to dramatically improve system performance. Michael Picheny is the Senior Manager of the Speech and Language Algorithms Group at the IBM TJ Watson Research Center. Michael has worked in the Speech Recognition area since 1981, joining IBM after finishing his doctorate at MIT. He has been heavily involved in the development of almost all of IBM's recognition systems, ranging from the world's first real-time large vocabulary discrete system through IBM's product lines for telephony and embedded systems. He has published numerous papers in both journals and conferences on almost all aspects of speech recognition. He has received several awards from IBM for his work, including a corporate award, three outstanding Technical Achievement Awards and two Research Division Awards. He is the co-holder of over 30 patents and was named a Master Inventor by IBM in 1995 and again in 2000. Michael served as an Associate Editor of the IEEE Transactions on Acoustics, Speech, and Signal Processing from 1986-1989, was the chairman of the Speech Technical Committee of the IEEE Signal Processing Society from 2002-2004, and is a Fellow of the IEEE. He served as an Adjunct Professor in the Electrical Engineering Department of Columbia University in 2009 and co-taught a course in speech recognition. He recently completed an eight-year term of service on the board of ISCA (International Speech Communication Association). Most recently he was the co-general chair of the IEEE ASRU 2011 Workshop in Hawaii.
In contrast to traditional rule-based approaches to building spoken dialogue systems, recent research has shown that it is possible to implement all of the required functionality using statistical models trained using a combination of supervised learning and reinforcement learning. This new approach to spoken dialogue is based on the mathematics of partially observable Markov decision processes (POMDPs) in which user inputs are treated as observations of some underlying belief state, and system responses are determined by a policy which maps belief states into actions. Virtually all current spoken dialogue systems are designed to operate in a specific carefully defined domain such as restaurant information, appointment booking, product installation support, etc. However, if voice is to become a significant input modality for accessing web-based information and services, then techniques will be needed to enable spoken dialogue systems to operate within open domains. The first part of the talk will briefly review the basic ideas of POMDP dialogue systems as currently applied to closed-domains. Unlike many other areas of machine learning, spoken dialogue systems always have a user on-hand to provide supervision. Based on this idea, the second part of the talk describes a number of techniques by which implicit user supervision can allow a spoken dialogue system to adapt on-line to extended domains. Steve Young is Professor of Information Engineering and Senior Pro-Vice Chancellor at Cambridge University. His main research interests lie in the area of spoken language systems including speech recognition, speech synthesis and dialogue management. He is the inventor and original author of the HTK Toolkit for building hidden Markov model-based recognition systems, and he co-developed the HTK large vocabulary speech recognition system. More recently he has worked on statistical dialogue systems and pioneered the use of Partially Observable Markov Decision Processes for modelling them. He is a Fellow of the Royal Academy of Engineering, the International Speech Communication Association, the Institution of Engineering and Technology, and the Institute of Electrical and Electronics Engineers. In 2004, he was a recipient of an IEEE Signal Processing Society Technical Achievement Award; in 2010, he received the ISCA Medal for Scientific Achievement; and in 2013, he received the European Signal Processing Society Individual Technical Achievement Award.