 This paper proposes a system for dynamically locating and tracking sound sources using audio-visual information. The system uses pre-registered voice prints and facial features to locate specific targets, then tracks them while avoiding other sound sources. This improves the accuracy of downstream tasks such as speech recognition and reduces errors and word recognition by up to 14%. This article was authored by Jean Boucher, Lin Zhang, and Dong Qing Wang.