 Snap Machine Learning is a new library for fast training of generalized linear models. It combines innovations from algorithms and software and benefits from optimized hardware architectures such as IBM Power and NVIDIA GPUs. It builds on recent advances from parallel and distributed learning to remove the training time as a bottleneck for a range of applications. Workloads where training time can be critical are for example real-time learning where frequent retraining of models is desired. Ensemble learning where a large number of models need to be trained and combined and also large data applications where even the training of simple models can be very time-consuming. The three main new innovations in Snap Machine Learning are new algorithms designed to run on GPUs, efficient use of sparse data structures and the ability to scale out many nodes while keeping communication at a minimum. The library has a flexible API which can be used in a standalone mode via Python interface where it can be used in a distributed mode via a Apache Spark or MPI interface. Using Snap Machine Learning on IBM Power and NVIDIA GPUs, we were able to train a logistic regression model on the terabyte click logs dataset in under two minutes. The terabyte click logs is a large online advertising dataset released by the Cretiolabs in order to advance research in distributed machine learning. The dataset consists of a portion of Cretios traffic over a period of 24 days and every day over 160 million training examples were collected, so the whole dataset consists of over 4 billion labeled examples where a label indicates whether a user clicked on a network or not. So the Machine Learning task is to predict whether a user clicks or does not click on a particular ad. So Snap ML is particularly well suited to IBM Power and NVIDIA GPUs. So when the dataset is really huge and it doesn't fit in the aggregate GPU memory in your cluster, we have to use some techniques for swapping the data in and out of the GPU memory. So there are various ways we can do this to try to minimize the amount of data that has to be swapped. But in the end, for some applications, this time to copy the data from the CPU to the GPU can still become a bottleneck, in particular when one's using the PCIe interface. So the NVLink interface is a high bandwidth interface, so we can leverage that additional bandwidth to effectively hide this copy time behind the computation time on the GPU. What this means is that we can use the GPUs more effectively and that results in a faster training time in the end.