Clinical performance comparators in audit and feedback: a review of theory and evidence

Background Audit and feedback (A&F) is a common quality improvement strategy with highly variable effects on patient care. It is unclear how A&F effectiveness can be maximised. Since the core mechanism of action of A&F depends on drawing attention to a discrepancy between actual and desired performance, we aimed to understand current and best practices in the choice of performance comparator. Methods We described current choices for performance comparators by conducting a secondary review of randomised trials of A&F interventions and identifying the associated mechanisms that might have implications for effective A&F by reviewing theories and empirical studies from a recent qualitative evidence synthesis. Results We found across 146 trials that feedback recipients’ performance was most frequently compared against the performance of others (benchmarks; 60.3%). Other comparators included recipients’ own performance over time (trends; 9.6%) and target standards (explicit targets; 11.0%), and 13% of trials used a combination of these options. In studies featuring benchmarks, 42% compared against mean performance. Eight (5.5%) trials provided a rationale for using a specific comparator. We distilled mechanisms of each comparator from 12 behavioural theories, 5 randomised trials, and 42 qualitative A&F studies. Conclusion Clinical performance comparators in published literature were poorly informed by theory and did not explicitly account for mechanisms reported in qualitative studies. Based on our review, we argue that there is considerable opportunity to improve the design of performance comparators by (1) providing tailored comparisons rather than benchmarking everyone against the mean, (2) limiting the amount of comparators being displayed while providing more comparative information upon request to balance the feedback’s credibility and actionability, (3) providing performance trends but not trends alone, and (4) encouraging feedback recipients to set personal, explicit targets guided by relevant information. Electronic supplementary material The online version of this article (10.1186/s13012-019-0887-1) contains supplementary material, which is available to authorized users.


Introduction
Audit and feedback (A&F), a summary of clinical performance over a specified period of time, is one of the most widely applied quality improvement interventions in medical practice. A&F appears to be the most successful if provided by a supervisor or colleague, more than once, both verbally and written, if baseline performance is low, and if it includes explicit targets and an action plan [1,2]. However, reported effects vary greatly across studies and little is known about how to enhance its effectiveness [3]. In order to advance the science of A&F, the field has called for theory-informed research on how to best design and deliver A&F interventions [4,5]. Numerous hypotheses and knowledge gaps have been proposed requiring further research to address outstanding uncertainty [5,6]. One area of uncertainty is the choice of performance comparator included in feedback reports.
Although it is feasible to provide clinical performance feedback without an explicit comparison [7,8], feedback is typically provided in the context of a performance comparator: a standard or benchmark to which the recipient's observed performance level can be compared. Comparators play an important role in helping feedback recipients to identify discrepancies between current and desirable practice [9] and improve self-assessments [10]. While most often performance is compared against the average of a peer group [11], many other potential comparators have been proposed in the literature. The choice of comparator may have important implications for what message is conveyed by the feedback, and therefore how recipients react to it [12]. For instance, if a physician's performance level has improved since the previous audit but remains well below national average, comparing against the physician's previous level would suggest that there is no need for change, whereas comparing against the national average would suggest the opposite. At the same time, existing psychological theories suggest that the mechanisms by which recipients respond to feedback are complex, making it less obvious that recipients adopt an 'externally imposed' performance comparator as a personal target [7,13]. Empirical studies show that, instead, recipients may reject feedback recommendations to pursue other levels of performance [14,15]. To date, little evidence informs A&F intervention designers about which comparators should be chosen under what circumstances and how they should be delivered to the recipients [5,16].
We aim to inform choices regarding performance comparators in A&F interventions and help identify causal mechanisms for change. Our objective was to (1) describe choices for delivering clinical performance comparators in published A&F interventions and (2) identify the associated mechanisms from theories and empirical studies that might have implications for effective A&F.

Methods
To identify current choices for performance comparators, we examined all A&F interventions evaluated in the 146 unique trials included in the 2012 Cochrane review [1] and the 2017 systematic review of electronic A&F [2]. The Cochrane review spanned 1982-2011; the systematic review spanned 2006-2016. Both reviews included the databases Cochrane Central Register of Controlled Trials, MEDLINE, EMBASE, and CINAHL. We developed a data extraction sheet and guide in order to extract details about delivered comparators from all included studies. These details included what comparators were delivered, their origin, specific values delivered, and the rationale for their use. The guide and sheet were piloted by 2 reviewers (WG and BB) on 10 initial studies followed by a second pilot on 10 additional studies, each after which improvements to terms and definitions were made. WG and BB independently extracted the data; disagreements were resolved through discussion.
To identify the potential mechanisms associated with each of the different comparators that have implications for effective A&F, we reviewed existing behaviour change theories and evidence from empirical A&F studies . Candidate theories were identified from a systematic review of theories used in randomised trials of A&F [17], contact with experts, and a supplemental theory-focused literature search following the methodology detailed by Booth and Carroll [18](Additional files 1). Empirical studies were the randomised trials included in the two reviews [1,2], and qualitative evaluation studies included in the systematic review and meta-synthesis that was recently undertaken by part of the study team [19]. We included theories and empirical studies if they described explanations of why, how, or when a behaviour may or may not occur as a result of the comparator choice within the context of receiving clinical performance feedback. From the included theories and randomised trials, we summarised relevant predictions and evidence. From the qualitative studies, we extracted and coded excerpts in batches using Framework Analysis [20] and Realistic Evaluation [21,22] (see details in [19]). We used an iterative process to formulate mechanisms for each comparator and refine and generalise across the included theories and empirical studies [23,24].
The consolidated results were discussed, refined, and agreed with the team. The 10-member study team has extensive expertise in designing and evaluating A&F interventions, behaviour change, implementation science, and health psychology. Three authors (HC, NI, JB) previously reviewed or have been involved in reviewing 140 randomised A&F trials [1,11], 3 authors (BB, SvdV, NP) reviewed 7 randomised trials of electronic A&F [2], and 4 authors (WG, BB, SvdV, NP) have reviewed 65 qualitative studies of A&F [19]. The team also included clinicians and experience as feedback recipient or feedback designer.
In the 'Results' section, we presented the descriptions and frequency with which performance comparators have been used in randomised trials of A&F interventions, followed by the comparators' mechanisms supported by theory and empirical evidence. Table 1 summarises the key characteristics of the included 146 RCTs [1,2] and 65 qualitative evaluation studies [19] of A&F interventions. We found that 98 of the 146 (67.1%) included A&F interventions used performance comparators within feedback messages; the remaining 48 intervention trials either explicitly stated they did not use a comparator or did not mention it. Possible comparators included the performance achieved by other health professionals (benchmarks, n = 88; 60.3%), recipients' own historical performance (trends, n = 17; 9.6%), or target standards (explicit targets, n = 16; 11.0%). Several interventions used more than 1 type of comparator (n = 19; 13.0%). Only 8 (5.5%) trials reported a rationale for using their specified comparator. We included 12 theories relating to general feedback mechanisms [7,9,25], goal-setting [13], guideline adherence [26], psychology [27][28][29][30], and sociology [31][32][33], and incorporated empirical findings from 5 randomised controlled trials and 42 qualitative studies to help explain comparator mechanisms and their potential effects on clinical performance. Table 2 provides these mechanisms and their theoretical and empirical support. Table 3 shows the details and frequencies of the comparators delivered in A&F interventions.

Benchmarks
In 88 (60.3%) interventions, the feedback included benchmarks, i.e. comparisons of recipients' achieved performance against that of other health professionals or peers. Benchmarks could be characterised by the group of peers being compared against (reference group), and the group's performance was represented (summary statistic). We identified 7 theories, 5 trials, and 32 qualitative studies that suggested mechanisms relevant to benchmarking (Table 2). Although benchmarks in principle do not necessarily explicitly state what levels recipients are expected to achieve, they may be perceived as targets that recipients use for improvement. In fact, they can harness competition between recipients (Social Comparison Theory [31]) and motivate recipients to change behaviour if they see others behaving differently (Persuasion Theory [27] and Social Norms Theory [33]), trying to maintain their status in a group of high-performing clinicians (Reference Group Theory [32]). Recipients who observe that others are achieving a certain level of  Theories (n = 4): Social Comparison Theory [31], Persuasion Theory [27], Social Norms Theory [33], Reference Group Theory [32]. Qualitative studies (n = 12): [34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52]. RCTs (n = 2): [53,54] Debilitates feedback effectiveness by directing attention away from the performance task at hand (e.g. prescribing appropriate medication) which allows recipients to explain away potentially bad performance if overall performance is low.
Theories (n = 1): Feedback Intervention Theory [7] Induces both positive and negative emotions dependent on whether relative performance level is high or low respectively by increasing competition through social influence.
Benchmarking against a reference group considered irrelevant or unfair by recipients (e.g. due to case-mix difference or inadequate statistical adjustment in outcome measures) inhibits feedback acceptance by decreasing credibility and perceived validity.
Benchmarking against values that reflect mean or median performance inhibits action by limiting recipients' perception of room for improvement (e.g. comparing against the mean only demonstrates discrepancies to half of recipients).
Benchmarking against values (e.g. the 90th percentile) inhibit feedback acceptance by low performers if they consider the discrepancy too large and unachievable.
Benchmarking against identifiable individual peers may increase effectiveness because recipients can choose the most relevant peers for comparison and increases their sense of competition knowing that their own performance is reported to others.
Benchmarking against identifiable individual peers inhibits feedback acceptance when recipients consider (semi)public reporting of their own performance inappropriate and a threat to their autonomy.
Multiple benchmarks (multiple groups or values, or individual peer scores) facilitates feedback acceptance by increasing credibility because it helps recipients assess variation between professionals and judge whether potential discrepancies are clinically significant.
Multiple benchmarks allow recipients to make downward social comparisons (defensive response to feel better about themselves) instead of upward social comparisons which inhibit action.

Trends
Facilitates action by decreasing the complexity in a way that helps recipients interpret and identify when clinical performance requires action, in particular, if the reference period includes sufficient time points at regular intervals dependent on the performance topic and number of observations each interval.
Increases the observability of the feedback intervention which induces positive emotions by demonstrating how recipients' clinical performance has improved over time as a consequence of their taken actions; higher improvement rates being associated with higher satisfaction.
Facilitates acceptance of feedback by increasing its credibility because performance is measured during a reference period that includes multiple time points (e.g. to eliminate the possibility of one-time coincidentally low performance).

Explicit targets
Facilitates action by reducing complexity of the feedback, making it easier for recipients to know what constitutes 'good performance' and therefore what requires a corrective response.
Targets from an external source that lacks power or credibility inhibit acceptance of negative feedback by inducing creates cognitive dissonance; recipients may respond by rejecting the target/feedback to resolve this dissonance and maintain the perception of self-integrity, Theories (n = 4): Ilgen et al. [25], Cabana et al. [26], Theory of Cognitive Dissonance [28], Self-affirmation Theory [29]. Qualitative studies (n = 2): [68,84].

Reference group
Benchmarks were typically drawn from the performance of peers in the same region (n = 39; 24.7%), state or province (n = 26; 17.8%), country (n = 21; 14.4%), or-in case of individualised feedback-other health professionals within the same unit, hospital, or department (n = 12; 8.2%). In 3 (2.1%) cases, benchmarks concerned similar-type peers such as only teaching hospitals or non-teaching hospitals. Finally, in 19 (13.0%) cases, comparisons to multiple peer groups were provided, such as the region and country, or only teaching hospitals and all hospitals in the province. Qualitative studies reported that recipients were more likely to accept the benchmark when they considered its reference group relevant and comparable [36,39,40,51,52,[61][62][63], as also hypothesised by the Reference Group Theory [32]. This suggests that regional comparisons are typically preferred over national ones, and comparisons that differentiate between the type of peers may be more effective than those that do not. Alternatively, recipients rejected feedback when they felt that the comparison was irrelevant or unfair, such as when they perceived inadequate case-mix adjustment or patient stratification [36,39,52,62,63].

Summary statistic
The most common benchmark value was the group mean (n = 37; 25.3%). Other summary statistics used were the mean of the top 10% peers (n = 7; 4.8%; also known as the achievable benchmark of care, or ABC benchmark, defined as the mean performance achieved by the top 10% best performers of the group [64]), the median (n = 6; 4.1%) or various other percentiles such as the 75th or 80th percentile (n = 6; 4.1%), and the recipient's rank or percentile rank in the group (n = 4; 2.7%).
In contrast to using a summary statistic as the value of a benchmark, feedback in 26 (17.8%) interventions presented the individual performance scores achieved by peers in the group, e.g. in a bar chart, histogram, or  Comparator Potential mechanisms and effects Theoretical and empirical support rather than question their own competency as a clinician.
Self-set targets (i.e. source is feedback recipients themselves) increase goal commitment and progress towards the target, but recipients may choose inappropriate targets (i.e. too low or unachievably high) to eliminate the discrepancy or because they do not know how to set targets.
Ambitious target values increase feedback effectiveness over simple targets as long as they are (considered) achievable.
Absolute target values are simple (decreasing complexity) than relative targets but can become outdated when achieved by most recipients which inhibits continuous quality improvement.
Relative targets based on benchmarking facilitate continuous quality improvement as can be automatically adjusted when the group performance changes, but also inhibits action because it creates uncertainty to recipients as to which performance levels should be targeted.
Relative target values based on benchmarking inhibit feedback acceptance if recipients consider them unfair, in particular, if performance is just below target and variation between peers is small and clinically insignificant.
Qualitative studies (n = 2): [59,84]. peers, as these could more clearly demonstrate discrepancies between actual and desired performance for the majority of feedback recipients [65][66][67]. Benchmarking against the mean reveals such discrepancies to at most half of the recipients and may not lead to the desired intentions to achieve the highest standards of care (Control Theory [9]). This was also supported by several qualitative studies in which recipients were not prompted to improve because the performance was 'in the middle' [35,59,68], or recipients were dissatisfied by comparing against the mean because they did not consider it as being the gold standard [35,62]. In a randomised trial comparing two variations of benchmarks, Kiefe et al. [65] found that comparing to the top 10% of peers led to larger feedback effectiveness than comparing to the mean. However, Schneider et al. [66] found that identifying the top performers in the context of a quality circle did not improve the effectiveness of feedback. Consistent with Goal-setting Theory [13], some low performers considered such high benchmarks unachievable and questioned or disengaged from the feedback [35,62] and may have benefitted more from comparing to the mean. Feedback in three (2.1%) interventions presented individual peers' performance scores while making the identities of those peers visible to recipients. In two cases, this concerned all peers [69,70], whereas the other, only the top performer was identified [66]. This approach may be effective as it allows recipients to choose the most relevant peers for comparison (Reference Group Theory [32]) and further increases their sense of competition knowing that their own performance is reported to others (Social Comparison Theory [31]). However, qualitative studies have reported that recipients experienced such open reporting as threatening and therefore preferred anonymous data [44,48,61,71,72].

Multiple benchmarks
Sixteen (11.0%) interventions used a combination of benchmarks, such as the mean and standard deviation, median and the top 10%, or peers' individual scores and interquartile range. Several qualitative studies have indicated that providing multiple benchmarks (that is, against multiple groups, multiple summary statistics, or peers' individual performance scores) may facilitate the credibility of feedback because it helps recipients assess variation between professionals and judge whether potential discrepancies are clinically significant [37,40,57,59,73,74]. However, it also increases the complexity of the feedback message-making it more difficult to understand whether performance requires attention or not as there are multiple values to which recipients can compare (Feedback Intervention Theory [7]). This allows recipients to make downward social comparisons, a defensive tendency in which they compare themselves against a group or individual that they consider 'worse off' in order to make themselves feel better about themselves (Social Comparison Theory [31]). In contrast, recipients who compare themselves against a group or individual that they perceive as superior can facilitate self-evaluation and improvement [31].

Trends
Feedback in 17 (9.6%) interventions included trends, i.e. comparisons to recipients' own previously achieved performance over a specified period (reference period). We identified 2 theories and 12 qualitative studies that suggested mechanisms relevant to trends ( Table 2). For example, Foster et al. [75] provided 1-time feedback at 6 months after the start of a multifaceted educational programme to increase adherence to asthma guidelines in which recipients' current performance was compared to that at baseline. Rantz et al. [76] provided feedback that included trends displayed as a line graph of recipients' performance over the previous 5 quarters. Trends allow recipients to monitor themselves and assess the rate of change in their performance over time. Feedback Intervention Theory [7] and theory on self-regulation [30] refer to this as velocity feedback and indicate that rapid rates of improvement lead to more goal achievement and satisfaction, whereas constant or delayed improvement rates ultimately lead to withdrawal. Empirical studies found that recipients who observed deteriorating performance were often prompted to take corrective action [37-39, 44, 46, 50, 51, 55, 77-83]. Upward trends made successful change observable to recipients which promoted satisfaction and other positive emotions [44][45][46][77][78][79][80]. Feedback messages that include performance at multiple time points may also facilitate the credibility of the message if a single instance of low current performance would have been considered a 'snap shot' explained away as chance or seasonal effects [39,45]. However, past performance does not clearly guide improvement: it tells recipients where they came from but not where they should end up. This may be 1 of the reasons that 13 of the 17 studies provided additional comparators (benchmarks or explicit targets).

Reference period
The reference period used to display trends, described by the number of time points and intervals of past performance, was typically consistent with the number of times and frequency with which feedback was provided. Most often, trends displayed quarterly (n = 7; 4.8%) or monthly (n = 4; 2.7%) performance; other variants were weekly (n = 2; 1.4%), biyearly (n = 2; 1.4%), or yearly (n = 1; 0.7%). While qualitative studies reported that recipients valued 'regular updates' , the exact frequency preferred by recipients typically depended on the clinical topic and the number of observations (e.g. patients) available each audit [37,39,45,46,82,83].

Explicit targets
In 16 (11.0%) interventions, health professionals received feedback with an explicit target: a specific level of achievement that is explicitly expected. Targets could be characterised by the person or party setting the target (source) and the level it is set at (value). Seven theories and 6 qualitative studies suggested mechanisms relevant to targets ( Table 2). The use of explicit targets reduces the complexity of feedback messages because it makes it easier for recipients to know what needs to be attained and whether corrective response is necessary (Control Theory [9], Goal-setting Theory [13], Feedback Intervention Theory [7]). Two qualitative studies confirmed this [84,85]. Explicit targets can be based on expert opinion, healthcare policies, performance data (e.g. benchmarks or trends), or a combination of these. The main difference between explicit targets, benchmarks, and trends is that the latter 2, despite potentially revealing important discrepancies with desired practice, may not explicitly judge current performance, leaving it to recipients to determine whether their performance is acceptable or not.

Source
Targets were set by an external party (i.e. externally set targets; n = 11) or locally by feedback recipients themselves (i.e. self-set targets; n = 5); two interventions used both. External targets were set by an expert panel (n = 3; 2.1%), investigators (n = 5; 3.4%), or guidelines or government (n = 3; 2.1%). Once (0.7%) it was unclear. While powerful target-setting sources can influence recipients' decisions to take action, theory by Ilgen et al. [25] predicts that feedback from a source with low power or credibility is easily rejected. Cabana's model of guideline adherence [26] indicates that physicians may have various reasons for non-adherence to recommended target, such as disagreement or lack of self-efficacy or outcome expectancy. Accepting a message indicating that performance is below a target requires recipients to acknowledge the fact that they are underperforming. However, this might conflict with the self-perception of being a capable and competent health professional, a situation referred to as cognitive dissonance (Theory of Cognitive Dissonance [28]). The theory states that recipients might find it easier to resolve this conflict by rejecting the externally imposed target, rather than question their own competency-even if the feedback holds compelling and meaningful information. Two qualitative studies reported similar response by recipients due to cognitive dissonance [68,84]. Self-affirmation Theory [29] explains that such defensive responses arise, in part, from the motivation to maintain self-integrity. Affirmations of alternative domains of self-worth unrelated to the provoking threat (e.g. by also emphasising on high performance on other care aspects) can help recipients deal with threatening information without resorting to defensive response [29].
When feedback recipients set performance targets themselves (self-set targets), they are more likely to commit to and gain progress towards the targets (Goal-setting Theory [13]). Qualitative studies have shown that feedback with self-set targets may decrease the consistency in clinical performance across recipients [85,86], in particular, if they are not supported by an external information source (e.g. benchmarking). Furthermore, recipients might adapt their targets to performance to eliminate discrepancies rather than vice versa (Feedback Intervention Theory [7]).

Values
Ambitious targets are more effective than easy ones as long as they are achievable (Goal-setting Theory [13] and Feedback Intervention Theory [7]). However, it might prove difficult to define a single target that is perceived as both ambitious and achievable by all recipients of a feedback intervention. Six (4.1%) interventions used absolute targets, or criterion-referenced targets, which are typically determined at or before baseline and do not change over time. For example, in Sommers et al. [87], an expert panel set a specific target (between 80 and 90%) for each quality indicator. Rantz et al. [76] provided 2 explicit targets to distinguish between good and excellent performance (e.g. 16% vs 6% rate of falls). In another 6 (4.1%) interventions, the targets related to benchmarking against best practice. For example, in Goff et al. [88], researchers set explicit targets at the 80th percentile of participants' baseline performance. Finally, 3 (2.1%) interventions set targets based on trends. For example, Fairbrother et al. [89] awarded financial bonuses to recipients who achieved 20% improvement from baseline, and Curran et al. [90] fed back statistical process control charts with control limits depended by the unit's past performance to define out-of-control performance. With absolute targets, it is possible for all recipients to pass or fail depending on their achieved performance level, whereas with relative targets by definition, discrepancies are only presented to a subset of recipients. Relative targets based on benchmarking may be considered unfair by recipients performing just below them, in particular when the distribution of performance scores is narrow and differences between health professionals are clinically insignificant [59,84]. Incremental targets demonstrate discrepancies to all recipients but may be unachievable when baseline performance is already high. Absolute targets are very simple to understand, but can become outdated when achieved by most recipients and should be reset in response to changing performance levels to remain appropriate [91]. Relative targets based on benchmarking can be automatically adjusted when the provider group performance changes. This facilitates continuous quality improvement (i.e. targets increase as the group improves), but due to its changing nature, it also creates uncertainty to recipients as to which performance levels should be targeted to guide improvement efforts [72]. However, in the included studies, relative targets were all set once and did not change.

Discussion
In an effort to inform the design and delivery of more reliably effective A&F, we reviewed 146 randomised trials to identify choices for delivering clinical performance comparators. Ninety-eight (67.1%) included 1 or more comparators. Health professionals' performance was compared against the performance of others (benchmarks; 60.3%), the recipient's own historical performance (trends; 9.6%), expected standards of achievement (explicit targets; 11.0%), or a combination of these (13.0%). Only 8 trials (5.5%) stated a rationale for using the specific comparators. We identified 12 behavioural theories and evidence from 5 randomised trials and 42 qualitative studies from which we distilled explanations of the mechanisms through which different comparators may support quality improvement.

Comparison to existing literature
In a re-analysis of the earlier Cochrane review by Jamtvedt et al. [92] (118 trials), Hysong [93] found no effect of adding benchmarks to A&F, regardless of whether or not identities of peers were known to recipients. While our findings suggest that benchmarking should increase the effectiveness of A&F by harnessing the social dynamics between recipients, there remain unanswered questions with respect to how benchmarks could work best. In line with our results, two empirical studies of A&F [14,15] demonstrated that benchmarking against the mean and the top 10% of performers influences recipients' intentions to improve on quality indicators, even though these intentions are not always translated into effective action [94,95]. Still, study participants ignored some benchmarks because they were too high or the indicator lacked priority [14].
The effect of explicit targets has been previously investigated by Gardner et al. [96] in their re-analysis of the Jamtvedt review [92]. Gardner's results were inconclusive at the time because very few studies explicitly described their use of targets, but the 2012 update of the review [1] showed that target setting, in particular in combination with action planning, increased the effectiveness of A&F. The role of involving recipients in setting targets themselves remains uncertain in healthcare settings [97,98]. An empirical study [15] showed that recipients may set their targets regardless of any benchmarks or trends and-potentially unrealistically-high, even when confronted with benchmarks of the top 10% reflecting much lower standards [15].
Brehaut et al. [5] recently advocated a single comparator that effectively communicates the key message. While multiple comparators may indeed send complex and mixed messages to recipients, we found that well-considered and presented multiple comparators may be beneficial to the effectiveness of A&F [99]. This underlines the complexity of this area and the need for more research.

Implications for practice and research
Our findings are useful for guiding the design of A&F interventions with respect to choice of performance comparator in feedback messages. We have identified a wide variety of comparators that may be included in feedback messages, as well as mechanisms and outcomes that potentially occur as a consequence of those comparators in terms of what message the feedback conveys (i.e. whether and how it reflects discrepancies with desirable practice), how recipients might respond, and ultimately the effectiveness of A&F. Many of the mechanisms we identified originate from behavioural science which offers a great amount of theoretical and empirical evidence not often taken into account by feedback designers [4,17]. The exact way in which a comparator modifies that response and the intervention effectiveness depends on various factors relating to the individual recipient or team, organisation, patient population, and/or clinical performance topic, in addition to whether/how the comparator reveals a discrepancy with current practice [19]. A&F designers should explicitly consider these factors and the mechanisms we presented and offer justification for their choice of comparator.
A single type of comparator that works for all recipients and for all care processes or outcomes targeted by the A&F intervention may not exist. Comparators should be designed to maximise feedback acceptance in the context of raising standards of care via multiple means. Based on our findings, we have four suggestions for choosing comparators: 1.
Step away from benchmarking against the mean and consider tailored performance comparisons Benchmarks work by leveraging the social dynamics between recipients, the main mechanisms of which have been described by the Social Comparison Theory [31] and Reference Group Theory [32]. However, 42% of the A&F interventions included in this study that used benchmarking involved comparisons to the group mean.
The theory predicts, and qualitative and quantitative evidence have demonstrated, that such comparisons are unlikely to raise performance levels comprehensively across feedback recipients. We recommended that recipients compare themselves to high-performing others that are both relevant and comparable to the recipient. However, if benchmarks are too high, they may be perceived as unachievable for low performers and lead to feedback rejection, or other unintended consequences. For example, a recent A&F study to reduce high-risk prescribing in nursing homes felt that benchmarking against the top 10% may risk unintended discontinuation of appropriate medications and therefore compared against the top quartile instead [100]. A solution to this problem may lie in tailoring of feedback messages to individual recipients or practices [12], for example by comparing low performers to the mean or median and others to the top 10%.

Balance the credibility and actionability of the feedback message
Qualitative studies have found feedback credibility and actionability to be important characteristics that should be properly balanced when choosing comparators. Based on a single comparator, health professionals may explain negative feedback away as a coincidental 'snapshot' of low performance, or question the data quality or fairness of the comparison [101]. Offering multiple performance comparators may help recipients assess whether there are true discrepancies with desired practice. For example, trends reveal whether low performance was one-time or has been consistent over time, and multiple benchmarks (e.g. individual peer scores) indicate performance in light of the variation between health professionals. Although providing multiple comparators may therefore increase the credibility of the feedback, it also increases its complexity and cognitive load and might send mixed messages to recipients. For example, if a health professional's performance has improved over time but remains below the top 10% of practices, a feedback message suggesting that improvement is needed might be inconsistent with the professional's interpretation that 'the numbers are improving so no further change is necessary' [5]. Hence, feedback should be presented in a way that clearly presents the key message (i.e. improvement is recommended or not), limiting the amount of information (e.g. comparators) presented to increase actionability, while allowing recipients to view more detailed comparative information if desired to increase credibility.

Provide performance trends, but not trends alone
Trends enable recipients to monitor performance and progress over multiple time points, and many qualitative studies have shown that recipients likely act upon observed performance changes. In fact, Feedback Intervention Theory [7] and theory on self-regulation [30] show that the rate of performance change (i.e. velocity) may be a more important motivator for change than the distance between performance and a goal (i.e. discrepancy). Trends also increase the credibility of feedback and enable a quality improvement cycle in which recipients continuously self-assess their performance upon which they decide whether or not to act. Trends therefore add substantial value to feedback and should be an explicit part of feedback messages. However, since trends only provide information about performance of the past and not the goal, they should be accompanied with other comparators (i.e. a benchmark or explicit target) that provide explicit direction for further improvement.

Encourage feedback recipients to set personal, explicit targets guided by relevant information
Goal-setting Theory [13], and various theories that extend it, predicts that explicit targets reduce feedback complexity because they set specific, measurable goals. However, qualitative studies report that unless such externally set targets were set by a broadly recognised, credible authority (e.g. national guidelines) or are linked to financial incentives, accreditation, or penalties, they may not be acceptable for a subset of recipients. We therefore recommend that feedback recipients are encouraged to set their own targets, guided by relevant information drawn from guidelines, expert opinion, and performance data, to which explicit comparisons can be made in the feedback. Feedback providers can collaborate with recipients to ensure the appropriateness of targets. Although recipients may consequently pursue different targets, it also enables them to commit to self-chosen targets that are both achievable and appropriate for themselves which reduces the chance of feedback rejection.

Strengths and limitations
To our knowledge, we are the first to have systematically considered existing relevant theories and empirical evidence to fill a key knowledge gap with regard to the use of clinical performance comparators in A&F interventions [4,6]. Few past studies have explicitly built on extant theory and previous research [17]. This work helps advance the science in the field by summarising the practical considerations for the comparator choice in the A&F design.
There are also several limitations. In using the 2012 Cochrane review of A&F and 2017 systematic review of electronic A&F to identify current choices for performance comparators, we were limited to randomised controlled trials being evaluated in a research setting only. Other study designs, and A&F used in non-research routine healthcare settings, might have yielded other types and/or frequencies of performance comparators that have been used. In particular, because A&F in research settings likely emphasises performance improvement while routine A&F may focus more on performance monitoring, we expect that the comparators and mechanisms we identified are more aimed at activating recipients to improve practice, rather than only supporting recipients to assess their performance. Another limitation is the quality of reporting and lack of consistency with regard to the terminology for comparators, particularly in the older studies [11,102]. One way in which this particularly might have manifested is that it was often unclear to which extent performance comparators were delivered as explicit targets. For example, studies that have used a particular benchmark may have added an explicit message that they are expected to achieve that standard, making the benchmark an explicit target as well, but it has not been reported as such in the paper. As a result, despite the prominence of targets in existing feedback theories [7,9,13], we have found limited evidence about the use of explicit targets.
Our review was limited to performance comparators at an aggregated level. When feedback is provided about individual patient cases, comparators at the patient-level may be included which allow feedback recipients to make performance comparisons for each patient [103]. We also did not explore the influence of the way in which comparators were displayed or represented in the feedback messages. Finally, we did not use meta-regression to examine and quantify the effects of each comparator because such an analysis would be vastly underpowered as a result of the large variety in comparator use across trials.

Unanswered questions and future research
Colquhoun et al. have generated a list of 313 theory-informed hypotheses that suggest conditions for more effective interventions of which 26 related to the comparators [6]. Our research delivers some important pieces of the puzzle to design and deliver effective A&F, but many other pieces are still missing. To move the science forward, more of these hypotheses should be tested. Within the domain of performance comparators, theory-informed head-to-head trials comparing different types of comparators (e.g. [100,104]) are needed to help uncover successful comparators tested under similar conditions.

Conclusion
Published A&F interventions have typically used benchmarks, historic trends, and explicit targets as performance comparators. The choice of comparator seemed rarely motivated by theory or evidence, even though abundant literature about feedback mechanisms exists in theories from behavioural and social sciences and empirical studies. Most interventions benchmarked against mean performance which is unlikely to comprehensively raise the standards of care. There appears to be considerable opportunity to design better performance comparators to increase the effectiveness of A&F. Designers of A&F interventions need to explicitly consider the mechanisms of comparators and offer justification for their choice.

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Authors' contributions
All authors contributed to the study conception and participated in critically appraising and revising the intellectual content of the manuscript. WG was primarily and BB secondarily responsible for the data extraction and the manuscript draft. All authors read and approved the final manuscript.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.