 We propose a transformer encoder-de-coder framework with multi-objective training strategy that incorporates CTC and MLM objectives to learn contextual bi-directional representations for speech representation models. Our initial embeddings outperform comparable models on multiple datasets before fine-tuning, and class attention is introduced as an efficient module for spoken language understanding. This article was authored by Quintin Mias, Marie Francine Mones, and Hugo Van Ham.