'Troubling' study finds Google's kidney disease-predicting AI performs worse in women—and may not have a quick fix

Acute kidney failure is a common condition among people who have already been hospitalized with another critical condition. It develops rapidly, over the span of just a few days, and if left untreated can lead to permanent kidney damage or death.

That’s why Google’s DeepMind division and the U.S. Department of Veterans Affairs were met with such excitement in 2019 when they unveiled an artificial intelligence system that had been proven in a study to help predict the presence of acute kidney injury up to two days in advance.

But another study published this month has found that the AI not only is less effective when applied to female patients, but also may not be easily fixed simply by adjusting its sample size and gender representation.

The DeepMind and VA researchers acknowledged this potential issue in the original study: “Female patients comprised 6.38% of patients in the dataset, and model performance was lower for this demographic,” they wrote at the time, though their findings were limited only to patients in the earlier stages of acute kidney failure.

The more recent analysis of the deep learning AI—which was led by researchers from the University of Michigan—expanded on those findings, concluding that the model incorrectly identified female patients across all levels of acute kidney injury severity. Among stage 3 patients, for example, the AI achieved an area under the curve of about 84% when applied to male patients, but only 71% for female patients.

After adding data from more women into the set used to train the model, the discrepancies in its performance were “largely corrected” among a general population, the researchers wrote in this month’s study. But the issue persisted when the AI was applied to a cohort made up only of veterans, they found, despite the improved controls for gender.

In fact, the model performed even worse with the corrections in place: It reached an area under the curve of just under 83% for men with stage 3 acute kidney failure and only around 69% for their female counterparts.

That finding is “troubling,” the researchers wrote, since it points to greater underlying issues with the model than simple underrepresentation. Instead, they continued, “other factors such as practice patterns or patient characteristics for females treated at the VA may account for this difference.”

The former is more likely based on their research, they wrote, suggesting that their study indicates that female veterans may be receiving significantly different treatment than men at VA facilities.

The model hasn’t yet been put into widespread use, and in a statement sent to Stat regarding the latest findings, the VA said it is “continuing to study various approaches before making a determination on the different models’ efficacy and/or suitability for any specific uses.”