If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Despite several recent meta-analyses on the topic, the comparative risk of hepatocellular carcinoma in patients with chronic hepatitis B (CHB) receiving entecavir (ETV) or tenofovir disoproxil fumarate (TDF) remains controversial. The controversy partly results from the arbitrary nature of significance levels leading to contradictory conclusions from very similar datasets. However, the use of observational data, which is prone to both within- and between-study heterogeneity of patient characteristics, also lends additional uncertainty. The asynchronous introduction of ETV and TDF in East Asia, where the majority of these studies have been conducted, further complicates analyses, as does the ensuing difference in follow-up time between ETV and TDF cohorts. Researchers conducting meta-analyses in this area must make many methodological decisions to mitigate bias but are ultimately limited to the methodologies of the included studies. It is therefore important for researchers, as well as the audience of published meta-analyses, to be aware of the quality of observational studies and meta-analyses in terms of patient characteristics, study design and statistical methodologies. In this review, we aim to help clinicians navigate the published meta-analyses on this topic and to provide researchers with recommendations for future work.
The primary manifestations of CHB are in the liver, although extrahepatic manifestations also contribute to the burden of disease. CHB is associated with the long-term risk of hepatocellular carcinoma (HCC), which is substantially reduced in patients who receive treatment with nucleos(t)ide analogue (NA) therapies.
However, the relative effectiveness of these therapies in reducing HCC risk in patients with CHB remains an outstanding question; between December 2019 and November 2020, 9 meta-analyses were published comparing HCC risk between ETV and TDF treatment (Table 1).
Despite the meta-analyses including similar primary studies (Table S1), they differ in the statistical significance of their results and their clinical interpretation. While some authors concluded that TDF is associated with a lower risk of HCC than ETV and should be the preferred treatment for CHB, others concluded that the risk of HCC is broadly similar between TDF and ETV. None have reported a higher risk of HCC with TDF than ETV. The conclusions of the meta-analyses are influenced by whether or not the results are statistically significant. However, the framework of statistical significance testing has been the subject of much debate in recent years, with calls for a change in statistical practice.
Statistical interpretation aside, a variety of reasons may contribute to the differing results from the meta-analyses. The lack of randomised studies on this topic leads to a reliance on observational studies, which introduce substantial heterogeneity into any meta-analysis. Differences between studies, such as in the inclusion/exclusion criteria, the duration of follow-up and the methodologies of analysis, pose challenges when performing aggregate meta-analysis. Differences within studies between patients who receive TDF and ETV also need to be accounted for; the earlier introduction of ETV in many countries and disparities in treatment and reimbursement guidelines could lead doctors to prescribe ETV and TDF to different groups of patients, who could have different underlying risks of HCC.
Chronic HBV infection is associated with the long-term risk of developing HCC.
The degree to which these sources of heterogeneity are adequately addressed ultimately leads to divergent results between meta-analyses. This review therefore outlines the challenges that researchers face when assessing the comparative risk of HCC in patients with CHB and summarises how previous meta-analyses have addressed these difficulties. We hope this review can help the community navigate the literature around HCC risk in patients who receive treatment with NAs.
Interpretation of statistical significance testing results
Although Table 1 suggests conflicting results between the meta-analyses, with some reporting significant differences between TDF and ETV and others reporting non-significant differences, these conclusions rely on a discretionary framework of statistical significance testing, which has been the subject of much debate in recent years. As an example of the limitations of this framework, 3 of the meta-analyses reported p values between 0.04 and 0.041;
should the significance threshold (which is set by convention at p <0.05 and is in fact arbitrary) have been altered (e.g. to p <0.01), these meta-analyses could have all concluded a lack of evidence to suggest a difference in HCC risk between the treatments.
While statistical significance testing can be useful, interpretation of significant or non-significant results must be made with caution.
it can be seen that beyond differences in statistical significance, there is more agreement than disagreement in the results of these meta-analyses. Nevertheless, this does not mean that all the meta-analyses are of equal quality; it thus remains important to understand the methodological challenges associated with performing these analyses.
Role of observational data
Randomised controlled trials (RCTs) are often considered the gold standard of evidence for comparing treatment efficacy. However, due to the low incidence rate of HCC in CHB patients, trials would have to recruit thousands of patients with a suitably long follow-up time to produce a meaningful result. For example, using the 5-year incidence rates published by Tseng et al. 2020 (3.44% and 3.39% for propensity matched ETV and TDF populations, respectively) to estimate the average probability of observing an HCC event, and taking the most conservative HR estimate from Table 1 (0.88),
approximately 75,000 patients would need to be enrolled and followed for 5 years for 90% power. Only 2 RCTs comparing HCC risk across ETV- and TDF-treated CHB patients have been published in the last 5 years, both with 144 weeks of follow-up; the first observed 5 HCC cases in 400 CHB patients, and the second observed 0 cases in 320 patients.
Treatment with nucleos(t)ide analogues, such as ETV or TDF, reduces the risk of developing HCC; however, the comparative effectiveness of ETV and TDF remains uncertain.
In contrast to the limited RCT evidence, observational studies may include many thousands of patients with multi-year follow-up, therefore tending to have a greater statistical power to detect differences between treatments. While many observational studies have been published comparing HCC incidence in patients treated with ETV and TDF, their results are not always consistent. Due to the nature of observational research, the studies have substantial heterogeneity. Table S2 summarises the key characteristics of the primary studies included in Choi 2020, Tseng 2020 and Cheung 2020;
not only are there differences in the characteristics between each study, but there is also heterogeneity between the TDF and ETV arms within individual studies. This heterogeneity poses challenges when performing aggregate meta-analysis, as discussed in detail in the following sections.
Observational studies based on administrative healthcare data, such as insurance claims data, often lack the detailed clinical data that studies using medical records provide, and thus researchers are only able to adjust to heterogeneity they are aware of, while missing potentially crucial unreported heterogeneity. Accordingly, meta-analyses often find that the smallest HRs are derived from claims database analyses. For example, Cheung et al. 2020 reported an adjusted HR of 0.63 from electronic databases, whereas the adjusted HR was 0.97 in cohort studies using clinical records.
The bias introduced by administrative claims databases is particularly noteworthy due to the large sample sizes of such studies, resulting in heavy weighting in meta-analyses. For example, the Choi et al. 2019 nationwide Korean administrative claims analysis had a total sample size of 24,156, resulting in >10% of the weighting when included in the meta-analyses listed in Table 1.
Any bias in the results from this study would therefore have a disproportionately large effect on the meta-analysis estimates.
Within-study heterogeneity in patient characteristics
Studies comparing the HCC incidence between ETV and TDF start with the null hypothesis that there is no difference between the 2 treatment arms. However, internal validity may be compromised because of an imbalance between the treatment arms in studies lacking randomisation, such as observational studies. When patients are non-randomly allocated to treatment, systemic differences in patient characteristics, and therefore baseline risk of HCC, can produce confounded study results (Table S2).
Such biases are particularly relevant when comparing ETV and TDF in observational studies, due to the asynchronous introduction of these medications. In East Asia, where most studies on this topic have been conducted, ETV was introduced around 2006, whereas TDF was not available in the region until 2011. Prior to the arrival of ETV, anti-HBV medications had limited potency, were susceptible to viral resistance and cross-resistance between medications, and often had to be given in combination. Patients initially receiving ETV included those with resistance to prior medications, who may therefore have had incomplete viral suppression and been at persistent risk of HCC while on ETV. Other patients may have received ETV and adefovir combination therapy, the majority of whom would have been switched to TDF when it became available; given the likely multidrug resistance mutations in these patients, they could be a particularly difficult group to treat in any TDF cohort. Therefore, the bias caused by the asynchronous introduction of TDF and ETV may be in either direction.
Analyses focusing on treatment-naïve patients may potentially avoid unmeasured confounding related to prior treatment experience. Nonetheless, naïve patients with different risks of HCC may be directed to preferentially take TDF over ETV, or vice versa. For example, ETV may be advised in elderly patients or those with underlying renal or bone disease, groups who may have a higher baseline risk of developing HCC, which may bias in favor of TDF.
These differences can also be seen in practice; of the 1,325 patients considered by Kim et al. 2018, ETV-treated patients were on average older than TDF-treated patients (52 years vs. 50 years) and had higher serum HBV DNA (6.4 vs. 6.0 log10 IU/ml; p <0.001), both of which are associated with a greater risk of HCC development.
Between-study heterogeneity in patient characteristics
In addition to being imbalanced within studies, the variables highlighted in Table 2 may also be imbalanced between studies (further detailed in Table S2). As the basic tenet of meta-analyses is homogeneity of primary studies, such imbalances introduce challenges, i) in preventing bias entering the meta-analysis, and ii) in appropriately interpreting the results of the meta-analysis. The degree of such heterogeneity in the estimates from meta-analyses can be expressed with the I2 statistic. Table 1 reports the I2 statistic for each meta-analysis; most studies had I2 values in the ranges of 40-60%, indicating moderate-to-substantial heterogeneity.
Taking one example of heterogeneity between studies, patients with cirrhosis are at higher risk of developing HCC than patients without cirrhosis,
and thus the impact of antiviral treatment may be most clearly seen in studies on patients with cirrhosis. On the other hand, the benefit of antiviral therapy may be greatest early in the disease course, when reducing/halting viral replication prevents viral mutagenesis of host DNA, reduces the influence of viral proteins (especially HBx) on host DNA transcription and lessens inflammatory damage and thus fibrosis/cirrhosis development.
Given the complex relationship between HBV infection, cirrhosis, HCC risk and anti-HBV medication activity, it is questionable whether it is appropriate to aggregate estimates of HCC risk from studies with predominantly non-cirrhotic cohorts to those from studies with predominantly cirrhotic cohorts. Meta-analyses have previously addressed this complexity through sub-group analyses, yet differences in results between meta-analyses are apparent. In Tseng et al. 2020, the adjusted HR for patients with cirrhosis was 0.84 compared to 0.66 for patients without cirrhosis, implying a greater reduction in risk for TDF-treated patients without cirrhosis.
Meta-analyses including studies on patients with and without cirrhosis may not reflect the true HR of either group, and even in the case of sub-group analyses, interpretation may be difficult. Readers should take care to interpret sub-group results in the context of their confidence intervals – if these are wide or overlap (particularly the case in patients without cirrhosis, where there are few HCC events), it may indicate a lack of certainty in the conclusions that can be drawn.
Fig. 2 illustrates this problem of heterogeneity in meta-analyses. It considers 2 hypothetical meta-analyses, one with high between-study heterogeneity and one with low between-study heterogeneity. The high heterogeneity meta-analysis aggregates estimates from a mixture of patient groups (cirrhotic, non-cirrhotic and mixed), whereas the low heterogeneity meta-analysis aggregates estimates from a homogenous pool, using only studies in patients without cirrhosis. The variation across different patient groups means the high heterogeneity meta-analysis relies on inconsistent estimates. As a result, the overall HR has high uncertainty (as represented by the wide confidence interval), resulting in a non-significant result. Combining heterogeneous patient groups may therefore reduce the precision of the overall estimate, potentially leading to false-negative (type 2) errors.
Between-study heterogeneity (such as the presence of cirrhosis, prior treatment experience etc.) leads to heterogeneous patient populations within meta-analyses, potentially confounding results.
Without adjusting for differences in the data, the overall estimate generated by the high heterogeneity meta-analysis is difficult to interpret. It neither reflects the HR of patients without cirrhosis, nor the HR of patients with cirrhosis. In contrast, the estimate of the low heterogeneity meta-analysis is easier to interpret, being specific to patients without cirrhosis; this estimate will have higher precision compared to the high heterogeneity analysis, resulting in greater certainty around the direction and magnitude of any effect. However, if meta-analyses focus excessively on homogenous patient populations, the generalisability of the findings may be limited.
Heterogeneity in follow-up time
The challenges discussed so far are relatively well-known problems associated with analysing observational data, and most of the meta-analyses employed some form of heterogeneity or quality assessment to examine the effects of imbalances in patient characteristics. However, a smaller number of these meta-analyses also addressed a more unique challenge in comparing ETV and TDF: within-study differences in follow-up time,
Due to the earlier introduction of ETV compared with TDF, longer follow-up durations are more often available for ETV-treated patients (Table S2). For example, in Kim et al. 2018, ETV-treated patients had a median follow-up duration of 66 months compared to 33 months for TDF-treated patients.
Assuming that TDF and ETV cohorts are perfectly matched at baseline (i.e. all patients are treatment-naïve, are similarly aged and have similar levels of fibrosis/cirrhosis), the longer follow-up time for ETV may result in differences in the cohorts as follow-up drops off substantially faster in the TDF arm (attrition bias). On the one hand, this may lead to an overall lower risk of HCC in the cohort with the longer follow-up, as long-term survivors form a greater proportion of the population and as the protective effect of anti-HBV treatment against HCC becomes more pronounced over time.
On the other hand, the longer follow-up may lead to more events being detected in the ETV group. Either way, comparing HCC incidence for a cohort with 66 months of follow-up vs. one with 33 months, even if baseline characteristics were similar, is akin to comparing 2 different cohorts; the cohort with longer follow-up is more likely to be enriched with patients having achieved HBeAg loss, HBsAg loss and regression of fibrosis/cirrhosis.
Research from the PAGE-B cohort (where the ETV and TDF sub-cohorts had 7.6 and 7.5 years of follow-up, respectively) demonstrated more frequent regression of cirrhosis in TDF-treated compared to ETV-treated patients at year 5 after treatment initiation (73.8% vs. 61.5% respectively, p = 0.038).
the suggestion of a difference in the impact of ETV and TDF on fibrosis/cirrhosis regression after 5 years, and the potential longer term benefits of this on HCC incidence, reinforces the idea that comparisons between cohorts with significantly differing follow-up times are inappropriate.
All published studies used a Cox proportional-hazards model to calculate HRs, which assumes no substantial difference between TDF and ETV groups in the relative risk of HCC throughout the study period. This would suggest that the issue of differing follow-up duration is not important; however, this assumption is rarely tested. Indeed, the meta-analyses by Li et al. 2020 and Tseng et al. 2020 found that in studies where ETV follow-up was ≥12 months longer than TDF, there was a significantly lower rate of HCC development in patients treated with TDF; while studies with more equal follow-up between arms reported no significant difference.
With the evidence suggesting that a difference in follow-up does bias results, researchers must employ additional methods to address follow-up differences between arms.
Potential technical solutions
The principal tools for researchers conducting observational studies to address the issue of within-study heterogeneity are propensity score matching (PSM) and covariate adjustment, both of which require individual patient data (IPD):
PSM selects similar (‘matched’) patients from each arm to include in the analysis, based on the baseline characteristics known to affect HCC risk. Patients whose characteristics are significantly different, for whom a ‘match’ cannot be made, are discarded. In doing so, the matched arms have near-balanced patient characteristics.
Covariate adjustment uses regression to adjust the estimate of HCC risk based on the imbalances in baseline characteristics between arms. In doing so, the adjusted estimates allow treatments to be compared as if there were no baseline differences in the chosen covariates.
Within-study heterogeneity also confounds results; in particular, differences in the follow-up time between ETV and TDF cohorts are often unaccounted for in primary studies and meta-analyses.
By utilising PSM- or covariate-adjusted estimates, meta-analyses can reduce the impact of within-study heterogeneity (Table S2). Specifically, because patients on ETV might be at higher baseline risk of HCC than TDF-treated patients, unadjusted estimates from meta-analyses tend to produce HRs associating TDF with a smaller risk of HCC development relative to ETV. For example, Gu et al. 2020 calculated an unadjusted HR of 0.71, but an adjusted HR of 0.77, and Tseng et al. 2020 calculated an unadjusted HR of 0.75 but an adjusted HR of 0.88 (Table 1).
Estimates that have not been sourced from PSM- or covariate-adjusted populations are therefore at a greater risk of bias, and as a result are less appropriate for informing clinical practice, compared to estimates from PSM- or covariate-adjusted populations.
Nevertheless, using PSM- and covariate-adjusted estimates does not guarantee that the resulting estimates are robust, particularly if key variables were omitted (Table S2). For example, in Kim et al. 2019 only 9 variables were used for matching, and well-known predictors of HCC, such as HBV DNA levels and alanine aminotransferase levels, were not included.
As previously mentioned, this is a particular issue in studies using administrative claims databases, where clinical data on key covariates may not be available, meaning that resulting adjusted estimates may still be biased in an unpredictable direction. Accordingly, researchers should be judicious in including administrative claims database studies in the primary analysis, potentially saving them for inclusion in sensitivity analyses.
Heterogeneity in follow-up time
As a type of within-study heterogeneity, the difference in follow-up time can be accounted for by selecting comparable observation periods (for example, truncating ETV data) and ensuring censoring is similar across arms.
However, many studies included in the published meta-analyses did not account for follow-up time or did not include the treatment initiation date in the matching variables. As such, another approach for researchers conducting meta-analyses would be to collect the IPD from the studies and either select patients based on adequate follow-up, or match patients with similar follow-up across arms. These data could then be analysed using IPD meta-analysis techniques. Alternatively, researchers may choose to exclude studies with substantial differences in follow-up time between arms, but this may drastically limit the information captured in the meta-analysis.
While technical solutions can help ameliorate these issues, an individual patient data meta-analysis, pooling data from different primary studies, would produce more robust estimates.
The issue of follow-up time bias between ETV and TDF is also likely to decline as more research is published. Studies conducted in 2019 and 2020 are more likely than earlier studies to have access to datasets with comparable follow-up times of 5+ years for both ETV- and TDF-treated patients. Such durations are sufficiently long to ensure that follow-up time bias could have only a negligible effect on the study’s overall results, while also allowing for sufficient events to be observed to allow for a comparison between ETV and TDF.
However, with larger sample sizes and longer follow-up times comes the risk of results being statistically significant (due to the high statistical power), without being clinically meaningful, yet another reason to interpret statistical significance testing results with caution.
In principle, exclusion criteria should be used to filter out studies in considerably different populations, or with substantial heterogeneity in methodologies. For example, many of the meta-analyses excluded patients with HIV or HCV coinfection, who may be at a higher risk of adverse outcomes compared to the wider CHB population.
Gu et al. 2020 and Choi et al. 2020 also specified that only studies classed as high-quality would be included in the meta-analyses, defined as a Newcastle-Ottawa score of over 6, or a MINORS (Methodological Index for NOn-Randomized Studies) score of over 10, respectively.
However, researchers may opt to err on the side of inclusion over exclusion when making these decisions to maximize the sample size and thus the power of the meta-analysis. The effects of heterogeneity can also be assessed through sub-group analyses or meta-regression:
Sub-group analyses involve reapplying an analysis to a subset of studies based on a characteristic of interest (e.g. populations with cirrhosis vs. populations without cirrhosis), and then assessing whether the produced estimates are different between these sub-groups.
Meta-regression applies covariate adjustment at an aggregate level, using the characteristic of each individual study (e.g. the proportion of patients with cirrhosis) for adjustment, and testing whether any feature is significantly associated with the outcome of interest (i.e. the risk of HCC).
Tseng et al. 2020 provides a good example of extensive sub-group analyses, examining the relationship between HCC risk and: adjustment method (covariate vs. PSM), publication type (article vs. abstract), race (Asian vs. mixed/non-Asian), study start date (before 2011 vs. after 2011), study scale (international vs. single country and multicentre vs. single site), funding source (industry vs. non-industry) and study time (prospective vs. retrospective).
The strength of such an approach is that it can build a picture of how and why differences in observational studies might be contributing to the overall HR estimate. This can help to contextualise the overall HR estimate – particularly when attempting to understand whether any difference between drugs is attributable to the drugs or other factors – and provide HR estimates directly for patient groups of interest.
Standardisation of observational study methodology in this field would also assist future researchers in conducting meta-analyses as new treatments for CHB emerge.
However, such sub-group and meta-regression analyses are usually performed on a small number of studies within each sub-group (or within each covariate category for meta-regression). Such analyses remain subject to uncertainty and may be prone to producing both false-positive and false-negative findings. They may produce false-negative results because there are too few patients to detect meaningful differences between sub-groups, and they may produce false-positive results due to, i) multiple testing and ii) confounded comparisons. For example, in the Tseng et al. 2020 sub-group analyses, the 3 industry-funded studies had lower HRs (HR = 0.68) than the 11 non-industry-funded studies (HR = 0.93).
However, 2 of the 3 industry-funded studies were based on administrative claims data and electronic health record databases, which may themselves be associated with lower HRs, thus the sub-group analysis examining the effect of funding may be confounded by differences in study design.
Despite 9 meta-analyses being published in short succession, there remains a lack of consensus as to whether ETV and TDF differ in their ability to reduce the long-term risk of HCC in patients with CHB. This is partly explained by the arbitrary nature of significance levels and challenges in interpretation of statistical significance, with Fig. 1 illustrating the overall agreement between the meta-analyses. However, not all of the meta-analyses are of equal quality: robust meta-analyses have used adjusted HRs and PSM cohorts, excluded low quality studies and examined sources of heterogeneity through sub-group analyses and meta-regression. Yet, a large degree of uncertainty remains even in these more robust meta-analyses. For example, bias may remain in many observational studies, even after PSM and covariate adjustment, particularly if key adjustment variables are not reported.
High-quality, multicentre RCTs would provide a yet unseen quality of evidence in the field; however, they are unlikely given the length of follow-up time and number of patients required. Researchers must instead focus on mitigating the biases that arise when working with observational data. This requires using clinical knowledge to assess which patient characteristics are biologically important, or have predictive utility, and selecting appropriate statistical methods to adjust for these variables in analyses. Throughout this process, there are many justifiable methodological decisions for researchers to make, and a range of plausible HR estimates and conclusions they could draw. That said, researchers are ultimately limited by the methodologies of the studies included in their meta-analyses. Simply performing more meta-analyses is unlikely to improve on the estimates from existing meta-analyses. Alternatively, an IPD meta-analysis would offer a more robust estimate, by allowing biases to be explicitly accounted for with consistent methodologies across all datasets. This approach has recently been used to gauge HCC risk in patients with chronic hepatitis C and to determine the utility of hepatitis B core-related antigen as a marker for high viral load in patients with CHB.
While other challenges, such as securing the agreement and ethics approval from sufficient studies, would arise, and an IPD meta-analysis would not address the potential lack of generalisability resulting from the predominance of studies conducted in East Asia, the methodology would nonetheless address many of the issues of the aggregate meta-analyses.
In the long-term, there is a role for our professional societies, including the European Association for the Study of the Liver, in developing consensus-based checklists for conducting observational studies on long-term outcomes in the field of liver diseases. While existing checklists such as STROBE provide valuable guidance on the reporting of observational studies, there remains a need for greater alignment in the design of observational studies looking at the long-term risk of HCC to help mitigate some of the methodological challenges explored in this review. As the next wave of CHB treatments are on the horizon, let us learn from the experience of the past to improve the robustness of future research efforts.
Support for third-party medical writing services on this manuscript was provided by Gilead Sciences .
All authors provided input on the original ideas for the manuscript; all authors reviewed and provided critical feedback on the manuscript drafts and all authors have approved the submitted version.
Conflict of interest
Won-Mook Choi has no conflicts of interest to disclose. Terry Yip has served as an advisory committee member and a speaker for Gilead Sciences. Young-Suk Lim is an advisory board member of Bayer Healthcare and Gilead Sciences and receives investigator-initiated research funding from Bayer Healthcare and Gilead Sciences. Grace Wong has served as an advisory committee member for Gilead Sciences and Janssen, as a speaker for Abbott, Abbvie, Bristol-Myers Squibb, Echosens, Furui, Gilead Sciences, Janssen and Roche, and has received research grants from Gilead. W Ray Kim has served as an advisory committee member for Gilead Sciences, Inovio Pharmaceuticals and Roche.
Please refer to the accompanying ICMJE disclosure forms for further details.
The authors acknowledge Hattie Cant, Tristan Curteis and Ben Farrar, from Costello Medical, for medical writing and editorial assistance based on the authors’ input and direction.
The following is the supplementary data to this article: