S. Goldwater, T. L. Griffiths, and M. Johnson, A bayesian framework for word segmentation: Exploring the effects of context, Cognition, vol.112, issue.1, pp.21-54, 2009.

C. Lee and J. Glass, A nonparametric bayesian approach to acoustic model discovery, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol.1, pp.40-49, 2012.

L. Ondel, L. Burget, and J. ?ernock?, Variational inference for acoustic unit discovery, Procedia Computer Science, vol.81, pp.80-86, 2016.

D. Harwath, A. Torralba, and J. Glass, Unsupervised learning of spoken language with visual context, Advances in Neural Information Processing Systems, pp.1858-1866, 2016.

E. Dupoux, Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner, Cognition, vol.173, pp.43-59, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01888694

M. Halle and K. Stevens, Speech recognition: A model and a program for research, IRE transactions on information theory, vol.8, issue.2, pp.155-159, 1962.

A. M. Liberman, F. S. Cooper, D. P. Shankweiler, and M. Studdert-kennedy, Perception of the speech code, Psychological review, vol.74, issue.6, p.431, 1967.

S. Schneider, A. Baevski, R. Collobert, and M. Auli, wav2vec: Unsupervised pre-training for speech recognition, 2019.

S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, Learning problem-agnostic speech representations from multiple self-supervised tasks, 2019.

Y. Chung, W. Hsu, H. Tang, and J. Glass, An unsupervised autoregressive model for speech representation learning, 2019.

A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Yi-lee, Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders, 2019.

D. Harwath, W. Hsu, and J. Glass, Learning hierarchical discrete linguistic units from visually-grounded speech, International Conference on Learning Representations, 2020.

W. Hsu, Y. Zhang, and J. Glass, Unsupervised learning of disentangled and interpretable representations from sequential data, Advances in neural information processing systems, pp.1878-1889, 2017.

S. Khurana, S. R. Joty, A. Ali, and J. Glass, A factorial deep markov model for unsupervised disentangled representation learning from speech, ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.6540-6544, 2019.

Y. Li and S. Mandt, Disentangled sequential autoencoder, 2018.

W. Hsu, Y. Zhang, and J. Glass, Unsupervised domain adaptation for robust speech recognition via variational autoencoderbased data augmentation, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.16-23, 2017.

W. Hsu, H. Tang, and J. Glass, Unsupervised adaptation with interpretable disentangled representations for distant conversational speech recognition, 2018.

W. Grathwohl, K. Wang, J. Jacobsen, D. Duvenaud, M. Norouzi et al., Your classifier is secretly an energy based model and you should treat it like one, International Conference on Learning Representations, 2020.

R. Ranganath, S. Gerrish, and D. M. Blei, Black box variational inference, 2013.

R. G. Krishnan, U. Shalit, and D. Sontag, Structured inference networks for nonlinear state space models, Thirty-first aaai conference on artificial intelligence, 2017.

T. Hofmann, B. Schölkopf, and A. J. Smola, Kernel methods in machine learning, The annals of statistics, pp.1171-1220, 2008.

J. Chorowski, R. J. Weiss, S. Bengio, and A. Van-den-oord, Unsupervised speech representation learning using wavenet autoencoders, speech, and language processing, vol.27, pp.2041-2053, 2019.

S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz et al., Generating sentences from a continuous space, 2015.

A. Hannun, A. Lee, Q. Xu, and R. Collobert, Sequence-tosequence speech recognition with time-depth separable convolutions, 2019.

A. Mohamed, D. Okhonko, and L. Zettlemoyer, Transformers with convolutional context for asr, 2019.

L. Maaløe, M. Fraccaro, V. Liévin, and O. Winther, Biva: A very deep hierarchy of latent variables for generative modeling, Advances in neural information processing systems, pp.6548-6558, 2019.

M. J. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta, Composing graphical models with neural networks for structured representations and fast inference, Advances in neural information processing systems, pp.2946-2954, 2016.

W. Lin, N. Hubacher, and M. E. Khan, Variational message passing with structured inference networks, 2018.

J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-umbach et al., Hidden markov model variational autoencoder for acoustic unit discovery, pp.488-492, 2017.

D. P. Kingma and M. Welling, Auto-encoding variational bayes, 2013.

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, Proceedings of the 23rd international conference on Machine learning, pp.369-376, 2006.

Y. Tian, D. Krishnan, and P. Isola, Contrastive multiview coding, 2019.

O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. V. Oord, Data-efficient image recognition with contrastive predictive coding, 2019.

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an asr corpus based on public domain audio books, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.5206-5210, 2015.

D. B. Paul and J. M. Baker, The design for the wall street journalbased csr corpus, Proceedings of the workshop on Speech and Natural Language, pp.357-362, 1992.

B. Uria, M. Côté, K. Gregor, I. Murray, and H. Larochelle, Neural autoregressive distribution estimation, The Journal of Machine Learning Research, vol.17, issue.1, pp.7184-7220, 2016.

A. V. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals et al., Wavenet: A generative model for raw audio, 2016.

S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain et al., Samplernn: An unconditional end-to-end neural audio generation model, 2016.

S. Vasquez and M. Lewis, Melnet: A generative model for audio in the frequency domain, 2019.

R. Prenger, R. Valle, and B. Catanzaro, Waveglow: A flow-based generative network for speech synthesis, ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.3617-3621, 2019.