This proposal brings together research communities from cognitive neuroscience and from automatic speech recognition to explore common interests in spoken language processing, potentially leading to a deeper scientific understanding of human speech and language function and to improved techniques for speech recognition by machine.  

The scientific study of speech and language is being transformed by new techniques for imaging the activities of the intact human brain, and by inputs from neurobiological sources – in particular neurophysiological and neuroanatomical research into auditory processing structures and pathways in non-human primates.

These developments are likely to lead to major breakthroughs over the next two decades in our scientific understanding of these crucial human capacities, and offer the potential for a detailed neuroscientific analysis of a major natural cognitive system. A key enabling component of these potential breakthroughs will be the ability to build detailed computational models of speech and language processing systems.

At the same time, over two decades of quite independent development, research into Automatic Speech Recognition (ASR), primarily within the Hidden Markov Model (HMM) framework, has led to the availability of commercially successful speech recognition systems able to work effectively in limited domains. These advances are built on powerful techniques for statistical modelling of the language system and of the complex acoustic information that distinguishes among the core set of speech sounds (phones) necessary for word identification. But despite these achievements, current ASR technology is still fragile in noisy environments, cannot easily adapt to new dialects and still struggles to handle continuous spontaneous speech, where language is used in its natural communicative context. These continuing difficulties suggest it may be timely to revisit neuro-biologically based solutions to robust real-time speech comprehension.

 
 

We propose to re-examine the relationship between research into human speech and language and research in ASR – in particular in the domain of Continuous Speech Recognition (CSR). Our overall goal (open question, grand challenge) is to construct a neuro-biologically realistic, computationally specific account of the human speech and language system. This will not only address fundamental scientific issues in the science of cognitive systems, but will also help to inform the development of future ASR systems.

There are several important areas that will need to be explored in the achievement of this goal, and where fruitful two-way traffic between the two communities can be expected. There are, for example, critical issues in the comparative functional architecture of the natural and artificial systems. How, for example, does the human system organise itself, as a neuro-biological system, to integrate top-down and bottom-up information as it synthesises a successful analysis of the speech stream? Is higher-level feedback used to optimise feature coding of the acoustic waveform, and does this have implications for machine recognition? A related issue is whether the strictly phone-based organisation of current ASR systems, where the primary goal is to map from acoustic information to a small set of phones, has any direct correspondence in the human system.

Another set of issues concern the relationship between HMM-based processes and neuro-biological processes. To what extent, for example, are the characteristics of human processors determined by the statistical properties of the speech input? If, as seems likely, learning from statistical regularity is important to the human system, then are these regularities being abstracted and represented in ways that are comparable to current ASR/CSR systems? If they are not comparable, then what sort of statistical models would be needed to model the human system and how can they be implemented?

These are just a few examples of possible points of contact between the two fields. Our goal over the next three months, through a series of joint workshops and discussions, is to refine and expand this list, and to develop a more articulated definition of a Foresight Grand Challenge in the domain of speech and language. An initial workshop is already being scheduled.

 

From the life sciences perspective, a neuro-cognitive account of human language function is one of the most important and exciting scientific challenges for the next two decades. Advances in this domain will have implications not only for our understanding of the neural bases of cognition in general, but also for a host of language-related applications. The most important of these is likely to be the treatment and remediation of disorders of language function, both developmental and acquired.

From the engineering viewpoint, it is clear that despite substantial progress over the last few years, human speech recognition performance still greatly exceeds that achievable by machine. A deeper understanding of how human speech understanding works should make a significant contribution to closing that gap. In the long term this is vital if we are to ever build truly natural human-computer interfaces.

 
The following people have agreed to develop this proposal further:
 
Professor William Marslen-Wilson
MRC Cognition and Brain Sciences Unit, Cambridge
Professor Steve Young
Cambridge University Engineering Department
Dr Roy Patterson
Center for the Neural Basis of Hearing, Department of Physiology, University of Cambridge.
Professor Lorraine Tyler 
 
Centre for Speech and Language, Department of Experimental Psychology, University of Cambridge 

 

The IBM logo is a registered trademark of IBM corp and is used under license