Beware of Automatic Decisions – Robert S Poston, MD

In 1990, the Society of Thoracic Surgery developed a national cardiac surgical database as an effort to improve surgical quality. For the past 2 decades, extensive data has entered this database from >95% of cardiac surgical procedures all across the US. As a result of its size, the STS database provides a powerful way to measure the quality of a surgeon or surgical program by comparing their average patient outcomes against national averages. More recently, this database has also been used to predict the risk that a given patient will experience a bad outcome. This prediction uses statistical models to add up the impact of a variety of different risk factors on the risk of death. Death after heart surgery is mostly caused by the trauma of the procedure leading to the failure of important organs like the lungs, liver or kidneys. That means that the strongest predictors of operative death are those that signal that these organ systems are vulnerable. Any surgeon asked to operate on a patient whose only problem is severe lung dysfunction immediately recognizes the risk. However, it is a more challenging task to recognize the risk of death caused by a variety of modest risk factors, such as mild dysfunction in 3 or 4 organ systems in an elderly diabetic. Humans don’t have enough cognitive bandwidth in our working memory to consider the impact of multiple variables at once. A computer armed with the right statistical models is far more capable than the human mind of considering how these multiple variables influences surgical mortality. The risk score it provides augments the surgical team’s ability to select appropriate cases that are not too high risk for a successful outcome.

Despite its strengths, there are also important weaknesses of the STS risk calculator. First, it is highly accurate at predicting that a surgeon operating on 100 patients with similar risk is likely to have 5 patients die, but much less precise in discriminating exactly who are those 5 patients. Second, it is not good at the extremes of a population, i.e., very high-risk patients. The tail of the bell-shaped curve often has too few patients on which to build a statistically valid model with a high level of discrimination. Third, several important risk factors are not included in the STS risk calculation, such as severe calcification of the aorta, a history of chest radiation, liver dysfunction, cognitive impairment, nutrition level, frailty, pulmonary hypertension and severe CHF as illustrated by B-type natriuretic peptide. Because these factors independently increase surgical risk, the models often assume these “unmeasured confounders” are present in a high risk group even when they are not. Finally, the risk of mortality improves over time, particularly for high risk patients, and the database models must be recalibrated to reflect this change. However, the STS online risk calculator used by clinicians for a bedside risk estimate is still based on the 2008 STS models with no recalibration since that time. Evidence has shown that all these issues cause the STS tool to overestimate risk for mortality in high-risk cases.

Based on the above, it is logical to conclude that the strengths of the STS database for risk predictions outweigh its weakness except perhaps for one clinical scenario – using the STS online risk calculator tool to try to discriminate a patient that is above a high-risk score cutoff. When a patient has a score deemed to be low-risk, our confidence in this estimate can be high and the patient can be confidently reassured. However, a score that comes back as high risk for surgery should be viewed skeptically, at least initially. Such an adverse assessment may very well be accurate and provide us with useful information. However, the known inaccuracies of the model within this patient subset obliges us to exercise due diligence, particularly when it leads to a conclusion that a patient is too high risk for surgery. We demonstrate this by asking the following:

Are the general impressions of the clinical team of the patient’s risk favorable (i.e. the patient “passes the eyeball test”)?
Is the patient free from any important unmeasured risk factors?
Can the typical approach to surgery be modified to reduce mortality risk?
Are the patient and family highly motivated to accept risk?

When the answers to these questions are all “yes”, it is likely that the risk score is overestimated. It is unfair to use an overestimate as a sole reason to exclude a patient from the benefit of a life-saving operation. Unfortunately, this is the exact protocol of the Heart Team committee at our hospital – to exclude patients from surgical consideration if there is a single machine generated estimate of risk for mortality that exceeds 8.0%. There is no opportunity for even a discussion of these cases, the exact ones which benefit the most from the judgments and experience of the multidisciplinary members of the cardiac program. According to our hospital CEO, a score >8.0% puts the final decision on autopilot with no opportunity for change.

I recognize the tremendous value of using computerized risk assessment as an aide to choose high risk cases wisely. But “the devil is in the details”. The damning problem with the rigid protocol employed by our hospital is not that it is based on a childlike understanding of how databases work. More importantly, it needlessly pits the risk assessments of the STS calculator against human judgment, creating an imaginary conflict of machine vs. man like in Terminator or The Matrix. One envisions our administrators preparing for the day when clinicians eventually band together behind Schwarzenegger or Reeves to stop the STS machine from oppressing our judgment.

Instead of that comic book scenario, maybe we can learn from another high reliability field struggling with their own man v. machine dilemma: airline pilots and their use of autopilot. Autopilot improves overall airline safety, but some pilots cancel out its benefits by misusing it. Many crash investigations have documented the problems that come from when a pilot’s attitude about autopilot is “set it and forget it”. The pilots of Korean Airlines 214, Continental 3407 and Aeroflot 593 all put their blind trust in this tool, causing them to idly stand by as it led to a crash. If the machine says its so, it must be true. (are you starting to see the analogy with STS risk score?)

Both medicine and aviation would be better served by reframing their challenges to automation not as man vs. machine but instead as man plus machine. A high performing team views the STS score and autopilot as key teammates. Like any teammate, the point of their automated outputs is to challenge our judgments. However, it is also our job to challenge theirs. Everyone, even the most brilliant teammate on earth, is fallible. We are not being good teammates if we accept anything on blind faith.

Most important and above all else, humans (not machines) get the final say. The many problems that arise when that rule is not followed are bizarre and tragic. Boeing designed an autopilot software program (MCAS) that was able to intervene on the flight of its 737 MAX jets without pilot input. Seemingly out of the blue, that MCAS software thrust two separate jets downward directly into the earth, killing everyone on board, based on faulty input signals suggesting an abnormal angle of those two planes that was obviously incorrect to both pilots. Likewise, a recent risk score of >8% triggered an autopilot decision to exclude a salvageable patient from surgical consideration. Like a 737 MAX jet, our patient soon crashed from untreated coronary artery disease while our Heart Team remained willfully blind to the clear inaccuracies of the patient’s risk score.

If we recognize that all team members have their limitations, we will use the automated risk scores when they are likely to be accurate and engage in multidisciplinary debate about the best course of action when they aren’t.