### The Journal

### Published on behalf of

### Editors

#### Editor-in-Chief

Thomas F. Lüscher

#### Deputy Editors

Bernard J. Gersh

Gerhard Hindricks

Ulf Landmesser

Frank Ruschitzka

William Wijns

### Reader Services

### Corporate Services

# Statistical Guidelines

The application of adequate statistical methods is a prerequisite for publication in the *European Heart Journal*(EHJ)(for a basic statement see ‘Uniform Requirements for Manuscripts Submitted to Biomedical Journals’, *Ann Intern Med *1997 **126**: 36-47). The rationale of the EHJ regarding the statistical methods applied is ‘*Be as simple as possible, but as sophisticated as needed’*. For example, clinical trials with their formalized framework have to meet more specific statistical standards than many pathophysiological studies. Below are summarised relevant points (and pitfalls) regarding study design, analysis and reporting. For studies with a sophisticated design, the collaboration of a professional statistician is recommended.

The design and data analysis should be briefly but clearly described in ‘Methods’. Please specify the program used, the significance level and that it is two-sided (which is the rule). If feasible use standard methods. Whenever uncommon or new statistical methods are applied, a reference has to be given. For unusually complicated or innovative methods, it is recommended to provide a detailed description as ‘supplementary material’ for the interested reader.

### Design of Study

Most studies will start with some *a priori *hypotheses and the design of the study should be chosen in accordance with these goals. For example, the authors should indicate if the study is randomized or observational, prospective or retrospective, and involves single or multiple centers. Studies started to generate hypotheses should be described clearly as such. Both the population of interest and how a representative sample has been collected must be defined. Inclusion and exclusion criteria need to be specified, as must the way missing data are handled. Specify as few goals or endpoints as possible in order to avoid the multiple testing problem (see below). Studies should be done double-blind and assignments to groups or experiments should be randomized, if feasible. For clinical trials a power analysis to justify the number of subjects included is indispensable, but is also desirable for other studies. Exceptions are, for example, studies for rare diseases, where a rationale for the number of subjects is adequate. Post-hoc power analysis is to be avoided. For observational studies, the selection of subjects has to be carefully documented.

### Descriptive Statistics

For continuous data that are close to a normal distribution provide means and standard deviations (not standard errors). For non-normal, skewed data such as duration, BMI etc. give medians and boundaries of interquartile ranges. Do not use more than two relevant digits. Categorizing of continuous data (e.g. into quartiles, quintiles) is discouraged. It leads to a loss of information, usually needs more complicated methods than for continuous data and introduces demarcations which are valid only for this particular study.

Give absolute numbers and percentages for count data, in particular for small studies. When sample size is below about 200, percentages should be given without decimal places. Otherwise one decimal place is usually sufficient. Report only the relevant correlations, accompanied by a confidence interval.

Instead of presenting large tables, presentation in graphic form is preferred to make it easier to grasp the key message. If quantitative information is provided in addition, it should be relegated to ‘supplementary material’. Figures need to be done with great care to get the optimum out of them:

- appropriate relation of x- to y-axis
- sufficiently thick lines
- densely packed but not too densely (e.g. with curves)
- reflection about inclusion of point (0, 0) in the graph
- use of colour instead of dot, dash-dot etc in order to help the reader to grasp the message quickly (there are no colour charges in the EHJ)
- the axis should not be exaggerated to artificially inflate a minor difference, e.g. by truncating the axis

Pay attention to psychophysiological principles when designing a good figure.

Kaplan-Meier curves: it is recommended to give the number of people at risk for different strata and different time points.

### Model Building

Often, the goals necessitate a model building step. This can be a classical quantitative regression where the outcome is continuous and approximately normally distributed, and where the predictors are pretty straightforward. But more often the outcome is binary or time to event data, where logistic and survival models (including the Cox model) are used. Whatever the type of regression, the choice of predictors can be complex, yet is a crucial step in the analysis. Therefore, this step needs some elaboration in the manuscript.

In particular, authors should avoid automated stepwise procedures to select a set of predictors. Such procedures have a tendency to include spurious predictors or to miss influential predictors. Also the results are not transportable to similar data sets, since the set of predictors is random. Furthermore, bivariate screening of covariates is discouraged (Sun et al. 1996, *J Clin Epidemiol* **49**: 907-16). External clinical judgment is ideal for selection of predictors and results from the literature are also worthwhile. Otherwise model averaging or penalized maximum likelihood methods are considered appropriate model selection procedures. There are various statistical strategies to assess model choice and performance, bootstrapping being a prominent example. For more information see papers by Harrell and others in the statistical literature (e.g. *Stat Med* 1996, **15**:361-387). The authors should specify all variables initially considered as candidates for the model and the approach they adopted to derive the final model. The choice of confounders might be less critical than the choice of truly explanatory variables.

Various model assumptions need to be checked: for multiple linear regression residual analysis should be used to check for approximate normality and linearity. For Cox regression the proportionality assumption should be checked.

The modeling of repeated data – repeated within person – can be demanding. This occurs when there are one or more follow-ups, or when recordings are made at more than one location etc. These within-person data are correlated. Modeling such data would be easier if the correlations were all of the same size (‘compound symmetry assumption’). However, this is rarely the case, since correlations become lower with increasing distance (in time or location). In an ANOVA setting one has, in this case, to perform a Greenhouse-Geisser correction of p-values. For various designs there are other methods to analyze longitudinal data, which are not only statistically correct, but also offer more information about the development of some variable and group compared to a purely cross-sectional analysis. In the simple case of one follow-up, use the paired t-test or the Wilcoxon signed rank test instead of their two-sample analogs.

Another data setting which needs some care are clustered data. Examples include varying number of lesions under study per subject, and a varying number of family members when families are at stake etc. Again, one has a correlation structure for the within data as for repeated data, but the number of repetitions is itself random. Possibilities to analyze such data include mixed linear modeling, with the cluster as a random effect, or more broadly generalized estimating equations. As for repeated data, a professional statistician should usually be involved with this type of study. In a case where one has 270 stents for 250 patients, a patient-wise analysis is pragmatically indicated, taking one stent per patient.

Proving significance of an association – even in a quite complicated model – does not imply causality of the relationship.

This has to be inferred by other arguments. Whenever data are collected with measurement error, the parameters reflecting the strength of association in a multivariable model become biased. Since these measurement errors inflate the variance, the associations (correlations, regression coefficients, odds-ratios etc) are weaker than they truly are. A typical variable with substantial measurement error is blood pressure.

### Statistical Tests

Statistical tests offer a rational decision in case of uncertainty. This probabilistic statement is possible only at the price of formality. If your descriptive analysis is in favour of your hypothesis, but the p-value is not significant, you could not prove your case and you should concede this. Even with a p-value 0.05

There are several points to consider:

**Meeting assumptions for testing**: The t-test and many other tests assume approximately normally distributed data. This has to be checked from one’s own data or via the literature. Checking can be done graphically (box plots, histograms), by computing skewness and kurtosis or by applying the Shapiro-Wilk test, and there are other means to do so. In simple situations with non-normal data, one can use a non-parametric test for non-normal data (e.g. a Mann-Whitney test instead of the two-sample t-test). If this is not possible one should consider an appropriate transformation of the data to achieve approximate normality (the log-transformation is just one prominent example). Often the transformed scale offers advantages in terms of variability and/or interpretation. Please consult a professional statistician.

**Selection of Chi-square vs. Fisher’s exact test (or other variants of exact tests for cross-tables)**: For categorical data, statistical testing is often done with a Chi-square test. However, for studies with small sample sizes, an exact test such as Fisher’s exact test should be used rather than a Chi-square test.

**Presentation of p-values**: always give numeric value not just ns (p=0.14 and not ns for example). Give one leading digit for the p-values and in the case of very small p-values give p<0.0001.

** Confidence intervals**: the main results always need a confidence interval.

**p-value is not significant**: if the p-value does not reach significance, this is not proof that the scientific hypothesis is not correct. Other factors such as sample size, frequency of the given event or the relative effect size may explain the lack of statistical significance. Whenever a power calculation has been done, a probabilistic statement about the lack of effect can be made.

**Multiple testing**: If you test for more than one primary outcome or assess the risk of a number of polymorphisms, for example, you have to correct for error inflation due to multiple testing. The simplest procedure is a Bonferroni correction.

**One- or two-sided**: as a rule, testing should be performed two-sided.

**Comparison of p-values**: refrain from comparing the size of p-values when, for example, assessing differences for various groups. This needs some extra-testing.

**Significance vs. relevance**: a statistically significant result does not imply clinical or scientific relevance even if the p-value is very low. This relevance is a subject-matter decision. In addition to significant p-values it is advisable to give a measure of effect size, in particular for large studies, where effects with little relevance may become statistically significant. The boundaries of a confidence interval may also be useful to assess relevance of potential effects.

**Only compare groups which are comparable**: If groups are unequal in relevant aspects such as age, some measure reflecting the group differences should be used when testing e.g. for therapeutic differences. This can be done by using covariates or a propensity score. When reporting a study that uses a propensity score, the Methods section should include a description of the type of propensity analysis (matching, stratification etc), what covariates were used in the propensity score and whether balance was achieved with the propensity score. In a randomized trial of sufficient size, imbalances between groups should not be an issue if the randomization worked as designed.

### Special Topics and Guidelines

For some special topics, expert meetings have led to specialized guidelines. As a rule authors should adhere to these guidelines.

Clinical trials: please consult the CONSORT (CONsolidated Standards of Reporting Trials) statement. Check if you have attached the CONSORT checklist. It is advisable to make a copy of the protocol available.

When submitting a systematic review of clinical trials, please check that you adhere to the PRIISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement.

When planning a study on diagnostics, please consult the STARD (STAndards for the Reporting of Diagnostic accuracy studies) statement. Authors should complete the STARD checklist. Or make a cross-check before submission at the latest.

Before submitting an epidemiological study, be sure that the STROBE (STrengthening the Reporting of OBservational studies in Epidemiology) requirements are observed. The STROBE checklist is available here.

## For Authors

Open access options for authors visit Oxford Open