ICASSP 2020 Papers about Speech

ICASSP 2020 paper category

Papers Topic idea
Cross lingual transfer learning for zero-resource domain adaptation e2e Acoustic Modeling Share several DNN layers between multi-lingual in acoustic modeling, transfer learning
Deja-vu: Double Feature Presentation and Iterated Loss in Deep Transformer Networks *** Input features re-use with attention, A new objective function each layer
FRAME-LEVEL MMI AS A SEQUENCE DISCRIMINATIVE TRAINING CRITERION FOR LVCSR *** frame-level MMI robustness, evaluation criterion don't care about frame-wise decision, sequence discriminative learning
Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning *** Teacher-student training semi-supervised, pre-training, TS used to train spatial filter and front end feature extraction, Multi-channel, knowledge ditillation
G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR *** With G2G tts to generate alternative pronunciation to improve ASR proper noun recognition (Grapheme level)
Robust Multi-channel Speech Recognition using Frequency Aligned Network *** Training spatial filtering layer jointly within an acoustic model, frequency aligned network to prevent one frequency bin influencing others for more robust, Multi-channel
SNDCNN: Self-normalizing deep CNNs with scaled exponential linear units for speech recognition *** self-normalizing CNNs with SELU, removing SC/BN, Accelerate inference progress
SpecAugment on Large Scale Datasets *** Demonstrate why SpecAugment work, Time mask size depends on the length of utterance
Transformer-based Acoustic Modeling for Hybrid Speech Recognition *** evaluate transformer-based acoustic models (AM) for hybrid speech recognition
Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction *** Masked reconstruction loss, pre-training encoder for AM on a much larger amount of unlabeled data than th labeled data, domain adaption
A COMPREHENSIVE STUDY OF RESIDUAL CNNS FOR ACOUSTIC MODELING IN ASR e2e Acoustic Modeling Residual CNNs for LVSR to allow online streaming, SpecAugment to overcome overfitting
CGCNN: Complex Gabor Convolutional Neural Network on raw speech *** Complex Gabor filter to replace usual CNN's filter in coplex neural network to represent acoustic features rather than handcraft, take advantage of time-frequencey resolution and complex domain
DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition *** Deep feed-forward sequential memory network with self-attention (DFSM-SAN) outperforms vanilla self attention network, Memory mechanism
Effectiveness of self-supervised pre-training for speech recognition *** fine-tuned pre-trained BERT models using CTC, 10h labeled data with a vq-wav2vec vocabulary as good as 100h labeled data
Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation *** Time perturation in the frequency domain and sub-sequence sampling augmentation for S2S ASR training
LAYER-NORMALIZED LSTM FOR HYBRID-HMM AND END-TO-END ASR *** Layer-normalized used in different parts of a LSTM for hybrid and e2e ASR
Libri-Light: A Benchmark for ASR with Limited or No Supervision *** A new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, 60k hours largest corpus of speech
Small energy masking for improved neural network training for end-to-end speech recognition *** a time-frequency bin is masked if fbank energy in this bin is less than a certain threshhold generated from a uniform distribution, Improve WER in Librispeech relatively 11.2%/13.5%
Attention-based ASR with Lightweight and Dynamic Convolutions e2e asr General topics lightweight and dynamic convolution as an alternative architecture to self-attention to make computational order linear, CTC jointly training
Correction of Automatic Speech Recognition with Transformer Sequence-to-sequence Model *** Transformer used in ASR's output to be grammatical and semantical, ASR correction model, data augmentation and pretrain, outperform 6-gram LM rescoring and rescore with Transformer XL LM WER
Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems ***SP TTS trained on ASR corpora used to generate audio to extend SOTA ASR, text only is useful, outperform low-resource Libri-speech 100h 33% relative improvements
Independent language modeling architecture for end-to-end ASR ***SP Seperate decoder from encoder output to be a independent LM trained on external text data, could be used on low-resource asr
Self-Training for End-to-End Speech Recognition *** demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model, ensemble for diversity, semi-supervised, results outperform
End-to-End Multi-speaker Speech Recognition with Transformer New Models Transformer used in multi-speaker and multi-channel asr, self-attention restrict within one segment, results outperform
JOINT PHONEME-GRAPHEME MODEL FOR END-TO-END SPEECH RECOGNITION ***SP Sharing encoder layers in signal-phoneme and signal-grapheme, multi-task learning, joint model based on iterative refinement
LIGHTWEIGHT AND EFFICIENT END-TO-END SPEECH RECOGNITION USING LOW-RANK TRANSFORMER *** Low-rank transformer, reduce params number, boost the speed of training and ineference
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions *** Acoustic Modeling, Block: 1-d time channel convolution, BN, ReLU, can be effectively finetue on new datasets
A practical two-stage training strategy for multi-stream end-to-end speech recognition Robust Speech Recognition universal feature extractor (UFE) in a single-stream and pre-trained for multi-stream
Audio-visual Recognition of Overlapped speech for the LRS2 dataset *** oberlapped speech, audio-visual, LF-MMI to ignore a speech seperation and recognition
End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection *** VAD in attention based online and CTC, Blank label used to indentify no-speech region, espnet
End-to-end training of time domain audio separation and recognition *** single channel multi-speaker separation, Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer
Improving noise robust automatic speech recognition with single-channel time-domain enhancement network *** Speech enhancement (SE), a single-channel time denoising
Improving Reverberant Speech Training Using Diffuse Acoustic Simulation *** Reverberant speech simulation to augment asr task in classical NN
Low-frequency Compensated Synthetic Impulse Responses for Improved Far-field Speech Recognition *** generate low-frequency compensated synthetic impulse responses that improve far-field speech recognition
Multi-scale Octave Convolutions for Robust Speech Recognition *** a multi-scale octave convolutional layer to robust speech representation, enlarge receptive field, low-pass
Multi-task self-supervised learning for Robust Speech Recognition ***SP self-supervised, PASE, convolution, integrate RNN and CNN, multi-distortion
A comparative study of estimating articulatory movements from phoneme sequences and acoustic features Speech Production attention and BLSTM comparative used in articulators estimation in different modeling level using linguistic information
Speech-Based Parameter Estimation of an Asymmetric Vocal Fold Oscillation Model and Its Application in Discriminating Vocal Fold Pathologies *** vocal fold oscillation, unknown
ATTENTION-BASED GATED SCALING ADAPTATIVE ACOUSTIC MODEL FOR CTC- BASED SPEECH RECOGNITION Speech Recognition Adaption AGS attentioin based gated scaling, scaling gated matrix generate from lower layer with attention, CTC
Unsupervised pretraining transfers well across languages ***SP CPC contrastive predictive coding algorithm unsupervised pretraining extract features across language
Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR *** unsupervised speaker adaptation, unknown
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks Speech Recognition Confidence, Errors and OOVs BLSTM used to assess confidence of ASR output in word and sub-word level, lattice rnn improve sub-word level
Joint Contextual Modeling for ASR Correction and Language Understanding ***SP SLM in task specific and fine-tune GPT LM to re-rank the n-best ASR hypotheses, joint ASR output error corection and LU model, small amount in domain data to train
On Modeling ASR Word Confidence *** **Herogrneous Word Confusion Network HWCN modeling word confidence and a score calibration using comparision from different models, BiRNN lattice **
EXPLORING A ZERO-ORDER DIRECT HMM BASED ON LATENT ATTENTION FOR AUTOMATIC SPEECH RECOGNITION Speech Recognition General Topics Incorporate HMM and Transformer or LSTM to obtain explicit alignment with ease of end-to-end
GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition *** WFST decoding paradigm using in GPU and a novel Viterbi algorithm
Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling *** Smoothed maxpooling loss to detect keyword part and keyword, semi-unsupervised, on-device learning
Meta Learning for End-to-End Low-Resource Speech Recognition *** MAML for low-resource language and ASR
Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding *** Word-embedding for decoder's output regularization to maximize cosine similar , A new fused decoding machanism to take advantage of transformer decoder,
Synchronous Transformers for End-to-End Speech Recognition *** The decoder of transformer predict output conditioned from encoder chunk by chunk rahter than the whole input features to solve the online asr, forward-backward algorithm to optimize during training
Training ASR models by Generation of Contextual Information *** evaluate the effectiveness of weak-supervised asr by loosely related contextual information
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition Speech Recognition Representations and Embeddings Representation learning by reconstruct present temprol slice from fbank in further and past context frames on unlabeled data, used in training amall amount labeled CTC end-to-end asr, semi-supervised
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders *** Pretraining Bi-deirection transformer encoder in a large amount of unlabeled data and decoder is conditioned on jointly further and past frames
Multilingual acoustic word embedding models for processing zero-resource languages *** embedding by autoencoder and discriminative classifier on a well-resource language and used in zero source language
What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis *** Probing models which sythesis speech from the hidden features of end-to-end asr to examine what have listen in a ASR layer
CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition *** Determine acoustic segment by a CIF mechanism with self attention alignment (SAA) in chunk hop for streaming ASR
Streaming automatic speech recognition with the transformer model *** time-restricted self-attention and triggered transformer to achieve streaming asr used in transformer, SOTA in LibriSpeech
TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE *** using self-attention aligner