Thursday, September 15: KN3 OS3 Pannel Overview Tue. Wed. Thu.

Keynote Session 3 (KN3)

Thursday, September 15, 9:30 - 10:30

Chair: Keiichi Tokuda

In this talk, I will discuss our recent work on using neural networks for NLP and speech recognition tasks. Our work started with the sequence-to-sequence learning framework that can read a variable-length input sequence and produce a variable-length output sequence. The framework allows neural networks to be applied to new tasks in text and speech domains. I will talk about the implementation details and results of our implementation on machine translation, dialogue modeling, and speech recognition. We also find that unsupervised learning in our framework is simple, and improves the performance of our networks significantly.
End-to-end Learning for Text and Speech
09:30-10:30 Quoc V. Le

Coffee Break

Oral Session 3: Analysis and Modeling for Speech Synthesis (OS3)

Thursday, September 15, 11:00 - 13:00

Chair: Oliver Watts

Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.
A hybrid harmonics-and-bursts modelling approach to speech synthesis [bib]
11:00-11:30 Jonas Beskow, Harald Berthelsen
The quality of the vocoder plays a crucial role in the performance of parametric speech synthesis systems. In order to improve the vocoder quality, it is necessary to reconstruct as much of the perceived components of the speech signal as possible. In this paper, we first show that the noise component is currently not accurately modelled in the widely used STRAIGHT vocoder, thus, limiting the voice range that can be covered and also limiting the overall quality. In order to motivate a new, alternative, approach to this issue, we present a new synthesizer, which uses a uniform representation for voiced and unvoiced segments. This synthesizer has also the advantage of using a simple signal model compared to other approaches, thus offering a convenient and controlled alternative for future developments. Experiments analysing the synthesis quality of the noise component shows improved speech reconstruction using the suggested synthesizer compared to STRAIGHT. Additionally an experiment about analysis/resynthesis shows that the suggested synthesizer solves some of the issues of another uniform vocoder, Harmonic Model plus Phase Distortion (HMPD). In text-to-speech synthesis, it outperforms HMPD and exhibits a similar, or only slightly worse, quality to STRAIGHT’s quality, which is encouraging for a new vocoding approach.
A Pulse Model in Log-domain for a Uniform Synthesizer [bib]
11:30-12:00 Gilles Degottex, Pierre Lanchantin, Mark Gales
This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor. A preliminary implementation of the proposed framework substantially outperformed (by a factor of 10 in terms of RMS F0 estimation error) existing F0 extractors in tracking ability of temporally varying F0 trajectories. The front end aperiodicity detector consists of a complex-valued wavelet analysis filter with a highly selective temporal and spectral envelope. This front end aperiodicity detector uses a new measure that quantifies the deviation from periodicity. The measure is less sensitive to slow FM and AM and closely correlates with the signal to noise ratio. The front end combines instantaneous frequency information over a set of filter outputs using the measure to yield an observation probability map. The second stage generates the initial F0 trajectory using this map and signal power information. The final stage uses the deviation measure of each harmonic component and F0 adaptive time warping to refine the F0 estimate and aperiodicity estimation. The proposed framework is flexible to integrate other sources of instantaneous frequency when they provide relevant information.
Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis [bib]
12:00-12:30 Hideki Kawahara, Yannis Agiomyrgiannakis, Heiga Zen
Speech sinusoidal modeling has been successfully applied to a broad range of speech analysis, synthesis and modification tasks. However, developing a high fidelity full band sinusoidal model that preserves its high quality on speech transformation still remains an open research problem. Such a system can be extremely useful for high quality speech synthesis. In this paper we present an enhanced harmonic model representation for voiced/mixed wide band speech that is capable of high quality speech reconstruction and transformation in the parametric domain. Two key elements of the proposed model are a proper phase alignment and a decomposition of a speech frame to "deterministic" and dense "stochastic" harmonic model representations that can be separately manipulated. The coupling of stochastic harmonic representation with the deterministic one is performed by means of intra-frame periodic energy envelope, estimated at analysis time and preserved during original/transformed speech reconstruction. In addition, we present a compact representation of the stochastic harmonic component, so that the proposed model has less parameters than the regular full band harmonic model, with better Signal to Reconstruction Error performance. On top of that, the improved phase alignment of the proposed model provides better phase coherency in transformed speech, resulting in better quality of speech transformations. We demonstrate the subjective and objective performance of the new model on speech reconstruction and pitch modification tasks. Performance of the proposed model within unit selection TTS is also presented.
Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis [bib]
12:30-13:00 Slava Shechtman, Alex Sorin

Lunch Break


Thursday, September 15, 15:30 - 16:30

Peter Cahil, Ingmar Steiner, Mirjiam Wester, Junichi Yamagashi, Heiga Zen

Coffee Break & Closing


International Speech Communication Association.

SynSIG: promoting the study of Speech Synthesis