ICASSP 2020 Papers about Speech
2020-03-22
11 min read
ICASSP 2020 paper category
Papers | Topic | idea |
---|---|---|
Cross lingual transfer learning for zero-resource domain adaptation | e2e Acoustic Modeling | Share several DNN layers between multi-lingual in acoustic modeling, transfer learning |
Deja-vu: Double Feature Presentation and Iterated Loss in Deep Transformer Networks | *** | Input features re-use with attention, A new objective function each layer |
FRAME-LEVEL MMI AS A SEQUENCE DISCRIMINATIVE TRAINING CRITERION FOR LVCSR | *** | frame-level MMI robustness, evaluation criterion don't care about frame-wise decision, sequence discriminative learning |
Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning | *** | Teacher-student training semi-supervised, pre-training, TS used to train spatial filter and front end feature extraction, Multi-channel, knowledge ditillation |
G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR | *** | With G2G tts to generate alternative pronunciation to improve ASR proper noun recognition (Grapheme level) |
Robust Multi-channel Speech Recognition using Frequency Aligned Network | *** | Training spatial filtering layer jointly within an acoustic model, frequency aligned network to prevent one frequency bin influencing others for more robust, Multi-channel |
SNDCNN: Self-normalizing deep CNNs with scaled exponential linear units for speech recognition | *** | self-normalizing CNNs with SELU, removing SC/BN, Accelerate inference progress |
SpecAugment on Large Scale Datasets | *** | Demonstrate why SpecAugment work, Time mask size depends on the length of utterance |
Transformer-based Acoustic Modeling for Hybrid Speech Recognition | *** | evaluate transformer-based acoustic models (AM) for hybrid speech recognition |
Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction | *** | Masked reconstruction loss, pre-training encoder for AM on a much larger amount of unlabeled data than th labeled data, domain adaption |
A COMPREHENSIVE STUDY OF RESIDUAL CNNS FOR ACOUSTIC MODELING IN ASR | e2e Acoustic Modeling | Residual CNNs for LVSR to allow online streaming, SpecAugment to overcome overfitting |
CGCNN: Complex Gabor Convolutional Neural Network on raw speech | *** | Complex Gabor filter to replace usual CNN's filter in coplex neural network to represent acoustic features rather than handcraft, take advantage of time-frequencey resolution and complex domain |
DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition | *** | Deep feed-forward sequential memory network with self-attention (DFSM-SAN) outperforms vanilla self attention network, Memory mechanism |
Effectiveness of self-supervised pre-training for speech recognition | *** | fine-tuned pre-trained BERT models using CTC, 10h labeled data with a vq-wav2vec vocabulary as good as 100h labeled data |
Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation | *** | Time perturation in the frequency domain and sub-sequence sampling augmentation for S2S ASR training |
LAYER-NORMALIZED LSTM FOR HYBRID-HMM AND END-TO-END ASR | *** | Layer-normalized used in different parts of a LSTM for hybrid and e2e ASR |
Libri-Light: A Benchmark for ASR with Limited or No Supervision | *** | A new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, 60k hours largest corpus of speech |
Small energy masking for improved neural network training for end-to-end speech recognition | *** | a time-frequency bin is masked if fbank energy in this bin is less than a certain threshhold generated from a uniform distribution, Improve WER in Librispeech relatively 11.2%/13.5% |
Attention-based ASR with Lightweight and Dynamic Convolutions | e2e asr General topics | lightweight and dynamic convolution as an alternative architecture to self-attention to make computational order linear, CTC jointly training |
Correction of Automatic Speech Recognition with Transformer Sequence-to-sequence Model | *** | Transformer used in ASR's output to be grammatical and semantical, ASR correction model, data augmentation and pretrain, outperform 6-gram LM rescoring and rescore with Transformer XL LM WER |
Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems | ***SP | TTS trained on ASR corpora used to generate audio to extend SOTA ASR, text only is useful, outperform low-resource Libri-speech 100h 33% relative improvements |
Independent language modeling architecture for end-to-end ASR | ***SP | Seperate decoder from encoder output to be a independent LM trained on external text data, could be used on low-resource asr |
Self-Training for End-to-End Speech Recognition | *** | demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model, ensemble for diversity, semi-supervised, results outperform |
End-to-End Multi-speaker Speech Recognition with Transformer | New Models | Transformer used in multi-speaker and multi-channel asr, self-attention restrict within one segment, results outperform |
JOINT PHONEME-GRAPHEME MODEL FOR END-TO-END SPEECH RECOGNITION | ***SP | Sharing encoder layers in signal-phoneme and signal-grapheme, multi-task learning, joint model based on iterative refinement |
LIGHTWEIGHT AND EFFICIENT END-TO-END SPEECH RECOGNITION USING LOW-RANK TRANSFORMER | *** | Low-rank transformer, reduce params number, boost the speed of training and ineference |
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions | *** | Acoustic Modeling, Block: 1-d time channel convolution, BN, ReLU, can be effectively finetue on new datasets |
A practical two-stage training strategy for multi-stream end-to-end speech recognition | Robust Speech Recognition | universal feature extractor (UFE) in a single-stream and pre-trained for multi-stream |
Audio-visual Recognition of Overlapped speech for the LRS2 dataset | *** | oberlapped speech, audio-visual, LF-MMI to ignore a speech seperation and recognition |
End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection | *** | VAD in attention based online and CTC, Blank label used to indentify no-speech region, espnet |
End-to-end training of time domain audio separation and recognition | *** | single channel multi-speaker separation, Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer |
Improving noise robust automatic speech recognition with single-channel time-domain enhancement network | *** | Speech enhancement (SE), a single-channel time denoising |
Improving Reverberant Speech Training Using Diffuse Acoustic Simulation | *** | Reverberant speech simulation to augment asr task in classical NN |
Low-frequency Compensated Synthetic Impulse Responses for Improved Far-field Speech Recognition | *** | generate low-frequency compensated synthetic impulse responses that improve far-field speech recognition |
Multi-scale Octave Convolutions for Robust Speech Recognition | *** | a multi-scale octave convolutional layer to robust speech representation, enlarge receptive field, low-pass |
Multi-task self-supervised learning for Robust Speech Recognition | ***SP | self-supervised, PASE, convolution, integrate RNN and CNN, multi-distortion |
A comparative study of estimating articulatory movements from phoneme sequences and acoustic features | Speech Production | attention and BLSTM comparative used in articulators estimation in different modeling level using linguistic information |
Speech-Based Parameter Estimation of an Asymmetric Vocal Fold Oscillation Model and Its Application in Discriminating Vocal Fold Pathologies | *** | vocal fold oscillation, unknown |
ATTENTION-BASED GATED SCALING ADAPTATIVE ACOUSTIC MODEL FOR CTC- BASED SPEECH RECOGNITION | Speech Recognition Adaption | AGS attentioin based gated scaling, scaling gated matrix generate from lower layer with attention, CTC |
Unsupervised pretraining transfers well across languages | ***SP | CPC contrastive predictive coding algorithm unsupervised pretraining extract features across language |
Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR | *** | unsupervised speaker adaptation, unknown |
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks | Speech Recognition Confidence, Errors and OOVs | BLSTM used to assess confidence of ASR output in word and sub-word level, lattice rnn improve sub-word level |
Joint Contextual Modeling for ASR Correction and Language Understanding | ***SP | SLM in task specific and fine-tune GPT LM to re-rank the n-best ASR hypotheses, joint ASR output error corection and LU model, small amount in domain data to train |
On Modeling ASR Word Confidence | *** | **Herogrneous Word Confusion Network HWCN modeling word confidence and a score calibration using comparision from different models, BiRNN lattice ** |
EXPLORING A ZERO-ORDER DIRECT HMM BASED ON LATENT ATTENTION FOR AUTOMATIC SPEECH RECOGNITION | Speech Recognition General Topics | Incorporate HMM and Transformer or LSTM to obtain explicit alignment with ease of end-to-end |
GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition | *** | WFST decoding paradigm using in GPU and a novel Viterbi algorithm |
Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling | *** | Smoothed maxpooling loss to detect keyword part and keyword, semi-unsupervised, on-device learning |
Meta Learning for End-to-End Low-Resource Speech Recognition | *** | MAML for low-resource language and ASR |
Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding | *** | Word-embedding for decoder's output regularization to maximize cosine similar , A new fused decoding machanism to take advantage of transformer decoder, |
Synchronous Transformers for End-to-End Speech Recognition | *** | The decoder of transformer predict output conditioned from encoder chunk by chunk rahter than the whole input features to solve the online asr, forward-backward algorithm to optimize during training |
Training ASR models by Generation of Contextual Information | *** | evaluate the effectiveness of weak-supervised asr by loosely related contextual information |
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition | Speech Recognition Representations and Embeddings | Representation learning by reconstruct present temprol slice from fbank in further and past context frames on unlabeled data, used in training amall amount labeled CTC end-to-end asr, semi-supervised |
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders | *** | Pretraining Bi-deirection transformer encoder in a large amount of unlabeled data and decoder is conditioned on jointly further and past frames |
Multilingual acoustic word embedding models for processing zero-resource languages | *** | embedding by autoencoder and discriminative classifier on a well-resource language and used in zero source language |
What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis | *** | Probing models which sythesis speech from the hidden features of end-to-end asr to examine what have listen in a ASR layer |
CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition | *** | Determine acoustic segment by a CIF mechanism with self attention alignment (SAA) in chunk hop for streaming ASR |
Streaming automatic speech recognition with the transformer model | *** | time-restricted self-attention and triggered transformer to achieve streaming asr used in transformer, SOTA in LibriSpeech |
TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE | *** | using self-attention aligner |