 Hi, I'm Danny Wilson, Professor at the Big Data Institute at the University of Oxford. I'm going to tell you about the harmonic mean P R package. The harmonic mean P package implements the harmonic mean P value procedure. The harmonic mean P value procedure, or HMP for short, is a method for combining P values from hypothesis tests. It addresses the problem of preserving statistical power while avoiding false signals when doing many tests. In Big Data analysis, the number of tests can increase with sample size. Suppose you wanted to find genetic variants that affect a trait by testing for association. The number of genetic variants increases with sample size, and therefore the total number of tests increases with sample size. Every additional test increases the chances of discovering a true signal, but it also increases the risk of discovering false signals or artifacts. To avoid artifacts, statisticians demand stronger evidence as the number of tests increases. This is called multiple testing correction. Bomb for only correction is the best known method. The problem is that multiple testing correction reduces the power to detect true signals as well as false signals. The HMP tries to solve this problem. It tries to maintain good power to discover true signals while avoiding false signals by using a combined test. A combined test has two benefits. It combines information across individual tests, and it reduces the burden of multiple testing. For example, the HMP can increase the power to detect genetic associations with a trait. Individual tests are subject to a stringent threshold represented by either of the two example dashed lines. But combining those p-values produces a more powerful test. For example, the HMP can test all genetic variants in a single chromosome for association with the trait. It works by taking the harmonic mean of the individual p-values in that chromosome. There are fewer chromosomes than genetic variants, so testing genes reduces the number of tests. That's the reason why in this graph we're showing a minus log 10 adjusted p-value that whereas none of the individual genetic variants were significant when combining them across the whole chromosome level, there is significant evidence of an association between variants on chromosome 12 and the traits of interest. So we reduce the burden of multiple testing, and that increases statistical power. The trade-off is that signals are made at a coarser scale, individual chromosomes instead of individual variants. One of the cool things about the HMP, though, is that it's a valid post-hoc test. That means that you don't need to pre-determine the groups. You can combine the p-values in arbitrary ways, for example by gene or pathway or sliding windows of varying sizes. It's even valid to find the smallest group of combined p-values that remain significant. Let's see how the HMP is used in practice. This example is based on the p-value that you can find on the harmonic mean p website. The package is downloaded from CRAM by the install.packages command. Once installed, the library is loaded in the usual way using the library or require command. Following the vignette, we can load example data. This data set is from a Nature Genetics paper studying neuroticism by Ochbe and colleagues. The data comprise tests of all variants on human chromosome 12. The first column, RS, is a label for the genetic variant. The second, POS, is the position of the variant for long chromosome 12. And the third, P, is the p-value testing the null hypothesis of no association between the genetic variant and neuroticism. We're going to test the combined null hypothesis of no association between any genetic variant on chromosome 12 and neuroticism. To do this, we need a few quantities. First, we need to know the total number of tests. Ochbe and colleagues analysed about six and a half million variants. Next, we need to know the number of tests specifically on chromosome 12. There were about 312,000 variants analysed on chromosome 12. Next, we need to specify the overall target false positive rate, for example 0.05. These quantities determine the significance threshold for the combined test. It's simply the overall false positive rate multiplied by the proportion of tests being combined. That gives a threshold for chromosome 12 of about 0.002. We perform the test in one line using the p.hmp function. We provide it with the p-values from chromosome 12 and the total number of tests. Finally, we compare the combined p-value with the relevant threshold. It's significant even though no individual test was significant on chromosome 12. So we conclude that there is an association somewhere on chromosome 12. The next step would be to narrow down the signal to the smallest group of combined p-values that remain significant on chromosome 12. The theory of the hmp is interesting in and of itself. The test statistic is the harmonic mean p-value, originally proposed as a rule of thumb by I.J. Goode in the 1950s. The combined p-value is the probability that this hmp exceeds the observed value, if the combined null hypothesis were true. This involves an arithmetic mean of inverse p-values which themselves follow individual heavy-tailed Pareto distributions. Sums of heavy-tailed random variables have special properties including self-similarity in the tail. Therefore, the combined p-value is approximately equal to the hmp itself when small enough. And this property is robust to dependence again when the hmp is small enough. This makes the hmp procedure more appropriate for combining dependent tests, the Fisher's method for example. The hmp procedure applies generalized central limit theorem to obtain a combined p-value from the directly calculated hmp. To learn more about the hmp, check out the paper and the harmonic mean p-package of CRAN. Thanks for listening.