 Recently, there has been a surge of interest in using self-attention networks, SAMs, based on the transformer architecture for natural language processing tasks. Additionally, large-scale, multi-purpose language models have also become increasingly popular for tasks such as speech recognition. To address these trends, researchers are exploring the use of pre-trained models which have been trained on large amounts of speech data for downstream tasks such as speaker verification. We propose a speaker embedding extractor based on SAMs to obtain a discriminative speaker representation from non-fixed-length speech utterances. Our approach achieves up to 41% relative performance improvement over the naive SAM used in our previous work. Furthermore, we explore the training stability issues associated with SAMs and suggest potential solutions to mitigate them. This article was authored by Poyn Safari, Miguel India, and Javier Hernando. We are article.tv, links in the description below.