The main goal of this project is to better understand the role drugs play within a critical hospital admission. This is so we can study if there may be a issue of over diagnosing or over prescribing drugs. This could occur due to many reasons such as doctor’s experience, patient characteristics, etc. To approach this problem, the MIMIC-III critical database was used as the primary data source.
Hypothesis: Medical treatment for patients is a risk just as much as it is a reward. Patients recieving medical treatment are at a greater risk of death, not just because of the serverity of the ailment, but from imperfect administrations.
Imperfect administrations may include:
# Get our library dependencies
library(RCurl)
## Loading required package: bitops
library("tidyverse")
## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The datasets were first downloaded locally, then lodaded into R because pulling them directly from the internet took considerable time to run.
The datasets can be found at:
ADMISSIONS - http://www.hpc.med.nyu.edu/~kannak01/MIMIC3/ PRESCRIPTIONS - http://www.hpc.med.nyu.edu/~kannak01/MIMIC3/ PATIENTS - http://www.hpc.med.nyu.edu/~kannak01/MIMIC3/ caregivers - https://github.com/kannan-kasthuri/kannan-kasthuri.github.io/blob/master/Datasets/MIMIC3/caregivers.csv chartevents_1 - https://github.com/kannan-kasthuri/kannan-kasthuri.github.io/blob/master/Datasets/MIMIC3/chartevents_1.csv chartevents_2 - https://github.com/kannan-kasthuri/kannan-kasthuri.github.io/blob/master/Datasets/MIMIC3/chartevents_2.csv chartevents_3 - https://github.com/kannan-kasthuri/kannan-kasthuri.github.io/blob/master/Datasets/MIMIC3/chartevents_3.csv
# Load our datasets
ADMS <- read.csv("C:/Users/vtrea/Desktop/ADMISSIONS.csv", header=TRUE, sep=",")
#ICU <- read.csv("C:/Users/vtrea/Desktop/ICUSTAYS.csv", header=TRUE, sep=",")
#INPUT <- read.csv("C:/Users/vtrea/Desktop/INPUTEVENTS_MV.csv", header=TRUE, sep=",")
PRES <- read.csv("C:/Users/vtrea/Desktop/PRESCRIPTIONS.csv", header=TRUE, sep=",")
#DRGCODES <- read.csv("C:/Users/vtrea/Desktop/drgcodes.csv", header=TRUE, sep=",")
To begin, we first want to isolate the answer to the question:
What causes the most deaths in the ICU?
The ADMISSIONS data from MIMIC-III provides information on a patient such as where they were admitted, what they were admitted for, and whether or not they died from their diagnosis or not. This data set is more useful than the PATIENTS dataset, because the PATIENTS data does not include any form of diagnosis, and we would have to join the dataset with another table in order to get that information. By using ADMISSIONS, for now we can omit that step and solely look at the outcomes per individual visit, not per patient, because a patient could have multiple visits and we want to isolate the diagnostic that led to the patient expiring on that specific visit. For repeat visitors, we can investigate to see whether or not there is a correlation between previous diagnostics and fatal diagnostics at a later admission.
ggplot(data = ADMS) +
geom_bar(mapping = aes(x = DEATHTIME != "", fill=ADMISSION_TYPE)) +
xlab("Patient Died after Admission") +
ylab("Count") +
guides(fill=guide_legend(title="Admission Type"))
From the above plot, we can see that there are about ~6500 patients that expired during admission, most of which unsurprisingly occur within the EMERGENCY admission.
Whenever admitted into a hospital, there are only two outcomes that can happen: you walk out better or you don’t. In a similar fashion we can categorize hospital diagnostics, where we only have tow categories:
In this way, let us categorize them as “Fatal” and “Non-Fatal” Diagnostics. When exploring the effects of drugs on hospital patients, we need t o take both into account, because specialized treatments differ greatly based on severity of the diagnostic.
There are thousands of levels in the diagnosis section, so in order to achieve a meaningful visualization, we use a scatterplot in order to try find any outliers to explore further. By using a scatterplot, we can view what cutoffs to apply to most of the data frequencies in order to get a more meaningful result, letting us see which diagnostics occur the most often.
DEATHADMS <- ADMS[ADMS[, "DEATHTIME"] != "",]
table <- table(DEATHADMS$DIAGNOSIS)
plot(c(table), xlab = "Diagnosis Index", ylab = "Count", main = "Fatal Diagnosis Distribution")
As we can see, we have a few outliers occuring at above the count of 50, so we can filter these out to achieve a better understanding of the main causes of fatalities within the facilities.
table <- table(DEATHADMS$DIAGNOSIS)
#str(table)
x <- c()
y <- c()
rowns <- rownames(table)
rowvals <- c()
for(row in table) {
rowvals <- c(rowvals, row)
}
for(index in 1:length(rowvals)) {
if(rowvals[index] > 60) {
x <- c(rowns[index], x)
y <- c(rowvals[index], y)
}
}
#x
#y
filtered <- matrix(, nrow = 0, ncol = 0)
for(diag in x) {
filtered <- rbind(DEATHADMS[factor(DEATHADMS[, "DIAGNOSIS"]) == diag,], filtered)
}
ggplot(data = filtered) +
geom_bar(mapping = aes(x = DIAGNOSIS, fill=ADMISSION_TYPE)) +
xlab("Patient Diagnosis") +
ylab("Count") +
guides(fill=guide_legend(title="Diagnosis")) +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
Here we can now clearly see that most of the fatal diagnostics occur in the EMERGENCY admission, as expected, with hemorrhage, pneumonia, and sepsis being the top diagnostics. We can further filter these results to view only these top three.
table <- table(DEATHADMS$DIAGNOSIS)
#str(table)
x <- c()
y <- c()
rowns <- rownames(table)
rowvals <- c()
for(row in table) {
rowvals <- c(rowvals, row)
}
for(index in 1:length(rowvals)) {
if(rowvals[index] > 150) {
x <- c(rowns[index], x)
y <- c(rowvals[index], y)
}
}
#x
#y
fatalfilt <- matrix(, nrow = 0, ncol = 0)
for(diag in x) {
fatalfilt <- rbind(DEATHADMS[factor(DEATHADMS[, "DIAGNOSIS"]) == diag,], fatalfilt)
}
ggplot(data = fatalfilt) +
geom_bar(mapping = aes(x = DIAGNOSIS, fill=ADMISSION_TYPE)) +
xlab("Patient Diagnosis") +
ylab("Count") +
guides(fill=guide_legend(title="Diagnosis"))
All three diagnostics, hemorrhage, pneumonia, and sepsis have very similar results, with sepsis only slightly outnumbering the rest. For fatal diagnostics we will be using sepsis as it has the most data to work with, however with such a small discrepency any of the other two would be just as viable, however it should be noted that they would then have also different standard treatments.
To find the top most non-fatal diagnoses, we apply the same method. We first look at the scatterplot distribution in order to get an idea of the threshold to filter out diagnostics to get the top non-fatal diagnoses.
NODEATHADMS <- ADMS[ADMS[, "DEATHTIME"] == "",]
table <- table(NODEATHADMS$DIAGNOSIS)
plot(c(table), xlab = "Diagnosis Index", ylab = "Count", main = "Non-Fatal Diagnosis Distribution")
Then, applying our own filter we extract those who only have an occurence greater than 800.
table <- table(NODEATHADMS$DIAGNOSIS)
#str(table)
x <- c()
y <- c()
rowns <- rownames(table)
rowvals <- c()
for(row in table) {
rowvals <- c(rowvals, row)
}
for(index in 1:length(rowvals)) {
if(rowvals[index] > 800) {
x <- c(rowns[index], x)
y <- c(rowvals[index], y)
}
}
#x
#y
nonfatalfilt <- matrix(, nrow = 0, ncol = 0)
for(diag in x) {
nonfatalfilt <- rbind(NODEATHADMS[factor(NODEATHADMS[, "DIAGNOSIS"]) == diag,], nonfatalfilt)
}
ggplot(data = nonfatalfilt) +
geom_bar(mapping = aes(x = DIAGNOSIS, fill=ADMISSION_TYPE)) +
xlab("Patient Diagnosis") +
ylab("Count") +
guides(fill=guide_legend(title="Diagnosis")) +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
Here we see a very interesting result. For the top non-fatal diagnostics, newborns greatly outnumber the rest. However, from further information from MIMIC, the newborn diagnostic is registered for a neonatal, and most, if not all, of the other diagnostic data is not on neonatals, so it would not be an accurate comparison. Thus, pneumonia or sepsis would make great viable candidate because they are both close to being the second most absolute survived diagnostic, and also appear as the most common fatal ones as well. In this case, we will pick pneumonia as it is less fatal than sepsis in general.
Now that we have the diagnostics we want to search into, sepsis and pneumonia, we can begin seeing if their death rates have a correlation with the rate at which patients are administered drugs.
In order to explore the correlations between patient deaths and drug usage, we need to join the prescription data with the admission data. Admissions contains information about the patient such as their ID and diagnostic, while precriptions includes what drugs they recieved, if any. According to the MIMIC website, these two databases are linked by the HADM_ID column.
ADMPRES <- merge(x = ADMS, y = PRES, by = "HADM_ID", all.x = TRUE, all.y = FALSE)
Performing a left outer join we make sure we keep patients who were admitted but did not recieve any drugs.
First, we want to explore tests to see whether or not death is related to the presence of a prescription using a Chi-square test. To do so, we need to find the number of people who had drugs and didn’t have drugs, then see how their numbers of deaths compare. To start this, we need remove all duplicate instances of patients from our set.
nodups <- ADMPRES[!duplicated(ADMPRES[, "SUBJECT_ID.x"]),]
By removing duplicates we can make sure we only take one entry per person so we don’t double count a death.
install.packages("plotrix")
Then we can create a 3d pie chart to visualize the distribution of people who had drugs versus people who didn’t recieve them
library(plotrix)
table <- table(is.na(nodups["DRUG"]))
#print(table)
pie3D(table, theta=pi/3, explode=0.2, labels=c("Drugs Used", "No Drugs Used"))
We can see that abouth a fifth of everyone who was admitted didn’t recieve any type of drug. Now we want to set up a matrix in ordeer to run association tests on drug usage and death. We create a 2x2 matrix and then run a chi squared test because we have a small matrix.
nodrug <- nodups[is.na(nodups[, "DRUG"]),]
drug <- nodups[!is.na(nodups[, "DRUG"]),]
death_nodrug <- nodrug[nodrug[, "DEATHTIME"] != "",]
death_drug <- drug[drug[, "DEATHTIME"] != "",]
nodeath_nodrug <- nodrug[nodrug[, "DEATHTIME"] == "",]
nodeath_drug <- drug[drug[, "DEATHTIME"] == "",]
#print(death_nodrug);
#print(death_drug);
#print(nodeath_nodrug);
#print(nodeath_drug);
# Make a data table and define row names and column names
tab <- matrix(c(nrow(death_drug),nrow(nodeath_drug),nrow(death_nodrug),nrow(nodeath_nodrug)),2,2)
rownames(tab)<-c("DEATH","NO DEATH")
colnames(tab)<-c("DRUG|","NO DRUG")
tab
## DRUG| NO DRUG
## DEATH 4446 549
## NO DEATH 34491 7034
# Run chi-square test
chisq.test(tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 115.19, df = 1, p-value < 2.2e-16
Thus, from the small p-value we can infer that there is good association between patient deaths (DEATHTIME) and the administration of drugs (DRUG).
However, this result does not completely validate our assumption that prescription accidents are a cause of death. Our Chi squared test only tells us that more often in procedures that use drugs, a patient death occurs. This can be entireley explained by the fact that more risky procedures and life threatening ones would require drugs whereas lesser risk ones would not.
We can further explore our assumption by then testing the correlation both in life threatening and non life threatening cases by now narrowing down our association to the two types of diagnostics we outline earlier: fatal and non-fatal.
From earlier, we found that SEPSIS is the cause of most patient deaths, so from our filtered data frames without duplicates and on certain parameters, we isolate only those who are diagnosed with sepsis, then run a Chi-squared test to check correlation again.
death_nodrugf <- death_nodrug[death_nodrug[, "DIAGNOSIS"] == "SEPSIS",]
death_drugf <- death_drug[death_drug[, "DIAGNOSIS"] == "SEPSIS",]
nodeath_nodrugf <- nodeath_nodrug[nodeath_nodrug[, "DIAGNOSIS"] == "SEPSIS",]
nodeath_drugf <- nodeath_drug[nodeath_drug[, "DIAGNOSIS"] == "SEPSIS",]
#print(death_nodrugf);
#print(death_drugf);
#print(nodeath_nodrugf);
#print(nodeath_drugf);
# Make a data table and define row names and column names
tab <- matrix(c(nrow(death_drugf),nrow(nodeath_drugf),nrow(death_nodrugf),nrow(nodeath_nodrugf)),2,2)
rownames(tab)<-c("DEATH","NO DEATH")
colnames(tab)<-c("DRUG|","NO DRUG")
tab
## DRUG| NO DRUG
## DEATH 184 14
## NO DEATH 548 21
# Run chi-square test
chisq.test(tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 3.1163, df = 1, p-value = 0.07751
As we can see, here the p-value has drastically changed, as is now above the 0.05 significance level, meaing we cannot reject the null hypothesis that the odds of dying with drugs are the same as the odds of dying without drugs in the case of fatal diagnostics.
Moving on the the cases of non-fatal diagnostics, we see that PNEUMONIAs are the highest number of admitted diagnosis that doesn’t lead to death. Similarly, we isolate these diagnostics from our filtered data, and run association tests.
death_nodrugn <- death_nodrug[death_nodrug[, "DIAGNOSIS"] == "PNEUMONIA",]
death_drugn <- death_drug[death_drug[, "DIAGNOSIS"] == "PNEUMONIA",]
nodeath_nodrugn <- nodeath_nodrug[nodeath_nodrug[, "DIAGNOSIS"] == "PNEUMONIA",]
nodeath_drugn <- nodeath_drug[nodeath_drug[, "DIAGNOSIS"] == "PNEUMONIA",]
#print(death_nodrugf);
#print(death_drugf);
#print(nodeath_nodrugf);
#print(nodeath_drugf);
# Make a data table and define row names and column names
tab <- matrix(c(nrow(death_drugn),nrow(nodeath_drugn),nrow(death_nodrugn),nrow(nodeath_nodrugn)),2,2)
rownames(tab)<-c("DEATH","NO DEATH")
colnames(tab)<-c("DRUG|","NO DRUG")
tab
## DRUG| NO DRUG
## DEATH 195 18
## NO DEATH 775 35
# Run chi-square test
chisq.test(tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 5.0446, df = 1, p-value = 0.0247
Here we can see that the while the p-value has incresed, it still remains below the 0.05 level, meaning we can reject the null hypothesis that states the odds of dying with drugs are the same as the odds of dying without drugs.
As we have seen, from this section we explored if for our hypothesis, there truly exists an association between the presence of drug admission and patient mortality. From the data as a whole, running a Chi-squared test yields an extremely low p-value on the whole data set, which would indicate some association is present. Because this is a critical care database, there are two extrema we should evaluate in order to assert our results: checking the most extreme diagnostics, the ones that produce the highest mortalities, and the ones that produce the lowest mortalities.
Through our investigation of sepsis, we found that when it comes to highest rate of mortalities, we actually cannot accept our original hypothesis, because our sepsis test resulted in a p-value far above the level of significane. This may be due to a few reasons:
Running our tests again, we can see that our p-value for pneumonia, the second highest cause of mortality rates, is under the significance level of 0.05. However, we again run into the issue where our category for now drugs is extremely low, due to the nature of the ailment.
Due to these conditions, it would be wise to re-evaluate our hypothesis. Rather than taking a broad stance on the use of drugs in all cases, we should narrow our scope down to better identify the use of drugs where it is less needed as a cause for patient expiry.
Hypothesis: Patients recieving medical treatment in cases of non-fatal and common diagnoses are at a greater risk of death, not just because of the serverity of the ailment, but from imperfect administrations.
Now that we have established al ink between drug administration and deaths, we want to look into why such an administration would occur. The first place to start looking would be at the person who administers the drug: the caregiver.
In order to pair up the caregiver data with the patient data, we also need to use the chartevents data, which tells us which caregiver attended the patient during a specific event. Chartevents joins on caregivers by the caregiver id, cgid, while admissions joins on the chartevents data on the patient’s HADM_ID.
In order to get the caregiver data joined on the admission / prescription data, we need to first link it to the CHARTEVENTS data on CGID. From there, the joined CHARTEVENTS data can be joined on ADMISSION data on HADM_ID.
CARES <- read.csv("C:/Users/vtrea/Desktop/caregivers.csv", header=TRUE, sep=",")
chart1 <- read.csv("C:/Users/vtrea/Desktop/chartevents_1.csv", header=TRUE, sep=",")
chart2 <- read.csv("C:/Users/vtrea/Desktop/chartevents_2.csv", header=TRUE, sep=",")
chart3 <- read.csv("C:/Users/vtrea/Desktop/chartevents_3.csv", header=TRUE, sep=",")
CHARTS <- rbind(chart1, rbind(chart2, chart3))
# Using our "nodups" variable, we join the data on the data frame defined by us containing admitted patients and whether or not they recieved a drug during their stay, limited only to a unique patient.
#str(CARES)
caregivers <- merge(x = CHARTS, y = CARES, by = "cgid")
caregivers <- caregivers[!duplicated(caregivers[, "hadm_id"]),]
#str(caregivers)
#str(nodups)
#In caregivers data, HADM_ID is written as hadm_id, so we can't merge the two together
colnames(caregivers)[4] <- "HADM_ID"
patients <- merge(x = caregivers, y = nodups, by = "HADM_ID")
patients <- patients[!duplicated(patients[, "subject_id"]),]
nodrugpats <- patients[is.na(patients[, "DRUG"]),]
drugpats <- patients[!is.na(patients[, "DRUG"]),]
#levels(drugpats[, "label"])
#print(patients)
Now with this data comes the question, what could be a factor in a caregiver’s decision to give a patient a drug? Here comes into play the caregiver’s knowledge: does a caregiver know enough about a diagnosis to be able to tell if a drug that has been prescribed to a patient is in the correct form i.e. is it the right type of drug or the right amount? In the data set, we have a column named “label”, which is the caregiver’s title, and we can use this to make assumptions about his/her experience in the field.
With this new data, we have only 100 total observations, so to gather the most data and to make understanding the 80 levels of caregiver labels, we will consolidate them all into only 3 levels: MD, PhD, and other, where MD and PhD show forms of higher education whilst other will be the rest.
However, taking a look at the distribution of the data we do not expect to get any useful results.
ggplot(data = nodrugpats) +
geom_bar(mapping = aes(x = label, fill=ADMISSION_TYPE)) +
xlab("Caregiver Type") +
ylab("Count") +
guides(fill=guide_legend(title="Admission Type"))
ggplot(data = drugpats) +
geom_bar(mapping = aes(x = label, fill=ADMISSION_TYPE)) +
xlab("Caregiver Type") +
ylab("Count") +
guides(fill=guide_legend(title="Admission Type"))
In any case, the first distribution having patients that recieved no drugs and had a caregiver would not be extremely useful anyway, since we do not expect most patients who did not recieve drugs to have a caregiver to administer them. However for the second distribution, drugs and caregivers, we can see there is an insufficient spread of data to be able to come up with a conclusive hypothesis about relating caregiver status and the safe use of drugs for a patient.
With the previous test being inconclusive, we must change our hypothesis and possibly explore alternative reasons to fatalities including drugs within non fatal diagnostics.
Hypothesis: In cases where patients were diagnosed with common diagnoses that have low numbers of mortality, patients were more likely to expire if they are administered drugs. Of these instances, there are potential subgroups relating to a patient such as religion or insurance, that may or may not affect them rejecting drugs in certain cases or recieving additional.
One of the most influential topics when it comes to beliefs is Religion. A person’s religion, if devout, impacts what a person does and believes. Many religions have different ideas about physical, spiritual, and mental energies, and how each of them should be taken care of. Some of these, such as Christian Scientists, do not believe in medical treatment and could therefore ask to reject or minimize any such treatments, with the possibility of fatal consequence, especially if prescribed an antibiotic after treatment and religious belief led to giving up on treatment quicker.
Since the PATIENT MIMIC 3 data set contains and expire flag that indicates if death ocurred even outside of the hospital, such a hypothesis would be valid.
First, let’s begin by assessing the distribution of religion. By plotting it first as a scatterplot, we don’t run the risk of overcrowding the data as we would in a simple bar plot, and can pick out the outliers to analyze.
PATS <- read.csv("C:/Users/vtrea/Desktop/PATIENTS.csv", header=TRUE, sep=",")
names(PATS)[2] <- "SUBJECT_ID.x"
#print(nodups)
# nodups contains the column name SUBJET_ID.x, and must so PATS must be changed from SUBJECT_ID in order to join
patsInfo <- merge(x = PATS, y = nodups, by = "SUBJECT_ID.x")
table <- table(patsInfo["RELIGION"])
plot(c(table), xlab = "Religion Index", ylab = "Count", main = "Religion Distribution")
Now we filter the counts to get all major religions and plot their counts.
table <- table(patsInfo$RELIGION)
#str(table)
x <- c()
y <- c()
rowns <- rownames(table)
rowvals <- c()
for(row in table) {
rowvals <- c(rowvals, row)
}
for(index in 1:length(rowvals)) {
if(rowvals[index] > 2500) {
x <- c(rowns[index], x)
y <- c(rowvals[index], y)
}
}
#x
#y
topRel <- c()
relig <- matrix(, nrow = 0, ncol = 0)
for(r in x) {
topRel <- c(topRel, r)
relig <- rbind(patsInfo[factor(patsInfo[, "RELIGION"]) == r,], relig)
}
ggplot(data = relig) +
geom_bar(mapping = aes(x = RELIGION, fill=ADMISSION_TYPE)) +
xlab("Religion") +
ylab("Count") +
guides(fill=guide_legend(title="Admission Type")) +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
install.packages("reshape")
table <- matrix(nrow = 0, ncol = 3)
colnames(table) <- c("Total", "No Drug", "Drug")
for(r in topRel) {
df <- relig[relig[, "RELIGION"] == r,]
df <- df[df[, "DIAGNOSIS"] == "PNEUMONIA",]
nodrugr <- df[is.na(df[, "DRUG"]),]
drugr <- df[!is.na(df[, "DRUG"]),]
table <- rbind(table, c(nrow(df), nrow(nodrugr), nrow(drugr)))
}
rownames(table) <- topRel
Names <- topRel
data=data.frame(cbind(table), Names)
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
# melt the data frame for plotting
data.m <- melt(data, id.vars='Names')
# plot everything
ggplot(data.m, aes(Names, value)) +
geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
#data
par(mfrow = c(2,3))
for(i in seq(1 : length(topRel))) {
pie(c(data[i,]$No.Drug, data[i,]$Drug), labels = c("No Drug", "Drug"), col=rainbow(2), main =topRel[i])
}
We need to narrow our data down if we are to make any reasonable assumptions. As it stands, a similar percentage of all people of a certain religion tend to not take drugs. However, this comparison is not accurate yet because it is across all diagnoses, and not just one. So we first must reduce our set in order to check for dependence. Let’s go back to our non-fatal diagnosis: PNEUMONIA.
table <- matrix(nrow = 0, ncol = 2)
colnames(table) <- c("Total", "Deaths")
for(r in topRel) {
df <- relig[relig[, "RELIGION"] == r,]
df <- df[df[, "DIAGNOSIS"] == "PNEUMONIA",]
nodrugr <- df[is.na(df[, "DRUG"]),]
drugr <- df[!is.na(df[, "DRUG"]),]
drugdeaths <- drugr[drugr[, "EXPIRE_FLAG"] == 1,]
nodrugdeaths <- nodrugr[nodrugr[, "EXPIRE_FLAG"] == 1,]
table <- rbind(table, c(nrow(drugr), nrow(drugdeaths)))
#table <- rbind(table, c(nrow(nodrugr), nrow(nodrugdeaths)))
}
rowNames <- c()
for(r in topRel) {
#rowNames <- c(rowNames, paste(r, "Drugs"), paste(r, "No Drugs"))
rowNames <- c(rowNames, paste(r, "Drugs"))
}
rownames(table) <- rowNames
table
## Total Deaths
## UNOBTAINABLE Drugs 87 61
## PROTESTANT QUAKER Drugs 112 46
## NOT SPECIFIED Drugs 209 113
## JEWISH Drugs 153 102
## CATHOLIC Drugs 329 193
fisher.test(table)
##
## Fisher's Exact Test for Count Data
##
## data: table
## p-value = 0.1442
## alternative hypothesis: two.sided
Now we see that the p-value is high, meaning that we don’t reject the null hypothesis that Religion and Drug usage are dependent on each other. This means a person’s religion does not influence their decision in taking prescribed medication.
With the correlation between drug usage in non fatal diagnosis and death, through isolating the main religions of people admitted and treated we looked to see if there was a causation due to a person’s religion. After comparing drug intake on the same diagnosis, PNEUMONIA, through a fisher test it’s concluded that a person’s religious view does not in fact impact the likelihood for him to take drug during critical care. This can be explained by a few reasons:
If a person’s religion doesn’t dictate how they act, are there any other cultural significances that could?
A person’s ethnicity may be reflective of the cultural ideas spawned in that area. In the Orient, there is a lot more emphasis on traditional medicinal techniques opposed to the Western modernization.Perhaps by analyzing the admitted patients’ ethnicities, we can find a correlation between certain ethnic groups/beliefs and drug usage.
By first making a scatterplot we can view the distribution of ethnicities.
table <- table(patsInfo["ETHNICITY"])
plot(c(table), xlab = "Ethnic Index", ylab = "Count", main = "Ethnic Distribution")
Filtering out only the occurnces of over 1000, we can get the top 5 ethnic groups admitted.
table <- table(patsInfo$ETHNICITY)
#str(table)
x <- c()
y <- c()
rowns <- rownames(table)
rowvals <- c()
for(row in table) {
rowvals <- c(rowvals, row)
}
for(index in 1:length(rowvals)) {
if(rowvals[index] > 1000) {
x <- c(rowns[index], x)
y <- c(rowvals[index], y)
}
}
#x
#y
topEth <- c()
eth <- matrix(, nrow = 0, ncol = 0)
for(r in x) {
topEth <- c(topEth, r)
eth <- rbind(patsInfo[factor(patsInfo[, "ETHNICITY"]) == r,], eth)
}
ggplot(data = eth) +
geom_bar(mapping = aes(x = ETHNICITY, fill=ADMISSION_TYPE)) +
xlab("Ethnicity") +
ylab("Count") +
guides(fill=guide_legend(title="Admission Type")) +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
Using this data, we can melt the multiple bar plots to view the total number of each ethnicity, and the number of drugs and no drug usages relative to each other.
table <- matrix(nrow = 0, ncol = 3)
colnames(table) <- c("Total", "No Drug", "Drug")
for(r in topEth) {
df <- eth[eth[, "ETHNICITY"] == r,]
df <- df[df[, "DIAGNOSIS"] == "PNEUMONIA",]
nodrugr <- df[is.na(df[, "DRUG"]),]
drugr <- df[!is.na(df[, "DRUG"]),]
table <- rbind(table, c(nrow(df), nrow(nodrugr), nrow(drugr)))
}
rownames(table) <- topEth
Names <- topEth
data=data.frame(cbind(table), Names)
# melt the data frame for plotting
data.m <- melt(data, id.vars='Names')
# plot everything
ggplot(data.m, aes(Names, value)) +
geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
#data
par(mfrow = c(2,3))
for(i in seq(1 : length(topEth))) {
pie(c(data[i,]$No.Drug, data[i,]$Drug), labels = c("No Drug", "Drug"), col=rainbow(2), main =topEth[i])
}
INterestingly enough, most ethnicities have similar ratios of taking drugs, with the exception of those ethnicities which are otherwise unknown or not falling into a main category.
table <- matrix(nrow = 0, ncol = 2)
table2 <- matrix(nrow = 0, ncol = 2)
colnames(table) <- c("Total", "|Deaths")
for(r in topEth) {
df <- eth[eth[, "ETHNICITY"] == r,]
df <- df[df[, "DIAGNOSIS"] == "PNEUMONIA",]
nodrugr <- df[is.na(df[, "DRUG"]),]
drugr <- df[!is.na(df[, "DRUG"]),]
drugdeaths <- drugr[drugr[, "EXPIRE_FLAG"] == 1,]
nodrugdeaths <- nodrugr[nodrugr[, "EXPIRE_FLAG"] == 1,]
table <- rbind(table, c(nrow(drugr), nrow(drugdeaths)))
table2 <- rbind(table2, c(nrow(nodrugr), nrow(nodrugdeaths)))
}
rowNames <- c()
rowNames2 <- c()
for(r in topEth) {
rowNames2 <- c(rowNames2, paste(r, "No Drugs"))
rowNames <- c(rowNames, paste(r, "Drugs"))
}
rownames(table) <- rowNames
rownames(table2) <- rowNames2
table[, c(1, 2)]
## Total |Deaths
## WHITE Drugs 729 438
## UNKNOWN/NOT SPECIFIED Drugs 30 19
## OTHER Drugs 15 8
## HISPANIC OR LATINO Drugs 26 14
## BLACK/AFRICAN AMERICAN Drugs 90 37
## ASIAN Drugs 31 22
fisher.test(table[, c(1, 2)])
##
## Fisher's Exact Test for Count Data
##
## data: table[, c(1, 2)]
## p-value = 0.5184
## alternative hypothesis: two.sided
Unsurprisingly, with such similarities in the ratios of drug usage there isn’t a clear dependency on ethnicity to dictate drug usage. However, the “uknown” and “other” categories were drastically different, hinting that there may still be some areas in which traditional medicines are valued over modern medicines from hospitals.
Other reasons why Ethnicity might not have an impact on drug usage:
If the use of drugs isn’t necessarily a choice in the hospital setting, what else could be indicative?
The sad truth is that not all people have the same standards of living, many are unforunate enough to be put into a setting where they aren’t as economically well of as the rest of the population. One metric we can try get a glimpse of this in is the patient’s insurance. Those with private insurance and don’t need government assistance are generally better off than those without. Using this, we can try determine if a person’s use of drugs is purely a choice, often made by the hospital, or are some unfortunate enough to not be able to accept it.
We can first see if there is any distinguishable difference between people’s differences in insurance: do people of different ethnicities have vastly different insurances?
par(mfrow = c(2,3))
for(r in topEth) {
df <- eth[eth[, "ETHNICITY"] == r,]
df <- df[df[, "DIAGNOSIS"] == "PNEUMONIA",]
table <- table(df$INSURANCE)
pie(table, col=rainbow(length(table)), main=r)
}
The main difference in insurances occurs that blacks have a higher percent of government insurance and hispanics have a higher percent of medicaid. These observations shouldn’t skew the data.
Now, we can run an examination on solely the insurance data to find out if different insurance types offer different types of drug treatments.
table <- matrix(nrow = 0, ncol = 2)
colnames(table) <- c("Total", "Deaths")
insurances <- levels(patsInfo$INSURANCE)
for(r in insurances) {
df <- eth[eth[, "INSURANCE"] == r,]
df <- df[df[, "DIAGNOSIS"] == "PNEUMONIA",]
nodrugr <- df[is.na(df[, "DRUG"]),]
drugr <- df[!is.na(df[, "DRUG"]),]
drugdeaths <- drugr[drugr[, "EXPIRE_FLAG"] == 1,]
nodrugdeaths <- nodrugr[nodrugr[, "EXPIRE_FLAG"] == 1,]
table <- rbind(table, c(nrow(drugr), nrow(drugdeaths)))
}
rowNames <- c()
for(r in insurances) {
#rowNames <- c(rowNames, paste(r, "Drugs"), paste(r, "No Drugs"))
rowNames <- c(rowNames, paste(r, "Drugs"))
}
rownames(table) <- rowNames
table
## Total Deaths
## Government Drugs 18 0
## Medicaid Drugs 70 36
## Medicare Drugs 647 431
## Private Drugs 183 71
## Self Pay Drugs 3 0
fisher.test(table)
##
## Fisher's Exact Test for Count Data
##
## data: table
## p-value = 4.675e-06
## alternative hypothesis: two.sided
Running a fisher test we see that we have a miniscule p-value, meaning that we reject the null hypothesis that the odds of having drugs on different types of insurances are the same.
Seeing that ethnicities did not bear any strong correlation with drug usage, it is not surprising to see that insurances within ethnicities are rather evenly spread even when there appears to be a correlation between insurance type and the administration of drugs.
We can see their percentages by plotting them in a pie chart:
par(mfrow = c(2,3))
pie(c(table[1,2], table[1,1] - table[1,2]), labels=c("Deaths", "Non Deaths"), col=rainbow(2), main="Government")
pie(c(table[2,2], table[2,1] - table[2,2]), labels=c("Deaths", "Non Deaths"), col=rainbow(2), main="Medicaid")
pie(c(table[3,2], table[3,1] - table[3,2]), labels=c("Deaths", "Non Deaths"), col=rainbow(2), main="Medicare")
pie(c(table[4,2], table[4,1] - table[4,2]), labels=c("Deaths", "Non Deaths"), col=rainbow(2), main="Private")
pie(c(table[5,2], table[5,1] - table[5,2]), labels=c("Deaths", "Non Deaths"), col=rainbow(2), main="Self Pay")
As we can see, Self Pay and Government have no deaths, however this should be taken lightly because there are almost no data points to begin with. Of the three remaining, Medicare has the highest rate of Deaths when it comes to the use of drugs in treatment, with medicaid and private following.
This agrees with our assumption that those who have private insurances are usually better well off, able to afford any care they might need. With the other forms of insurance, there can be complications with securing the proper amount of treatment leading to mortalities possibly due to the incorrect use of drugs.
However, we shouldn’t bar the deaths occuring with drugs only to insurance type and economic status of a patient, there is yet one important category we can explore: age. Medicare is primarily focused towards those above the age of 65, so we must examine this aspect of insurance as well.
#Takes a while to install
install.packages("eeptools")
First, let’s take a look at the age distribution using gender as a way to split the data into multiple bucekts. Unforunately, there is no clumn simply labeled “age”, but MIMIC provides us a way to create it. According to the page, a patient’s age is given by the difference between their first admission time and their date of birth. Therefore, we can obtain age by using a table which is made up from the join of admissions, which contains the admission time, and patients, which contains the date of birth. Using these, we can calculate the age of the patients, put into a vector and then bind the vector as a column to the end of our data set.
library(eeptools)
subset <- patsInfo[patsInfo[, "DIAGNOSIS"] == "PNEUMONIA",]
allAges <- c()
for(r in 1:nrow(subset)) {
x <- c(as.Date(subset[r, "DOB"]), as.Date(subset[r, "ADMITTIME"]))
ages <- age_calc(x[1], x[2], units = "years")
allAges <- c(allAges, ages)
}
withAge <- cbind(subset, allAges)
withAge <- withAge[withAge[,"allAges"] > 10,]
withAge <- withAge[withAge[,"allAges"] < 100,]
boxplot(allAges~GENDER,data=withAge, main="Age Distribution",
xlab="Gender", ylab="Age")
We can see that actually most of our patient data occurs for people who are later in life rather than those who are young. If the distribution is spread heavily between old and young, we could be seeing drugs as a cause of death in those who are older and not younger, which could have to do with simply the way their body reacts to the medicine.
To understand these ages better, we should first reduce the age column into three distinct age groups: 20-39, 40-59, and 60+.
ageGroups <- c("20-39 drugs", "40-59 drugs", "60+ drugs")
g1 <- 0
g2 <- 0
g3 <- 0
g1d <- 0
g2d <- 0
g3d <- 0
withAgeD <- withAge[!is.na(withAge[, "DRUG"]),]
for(r in 1:nrow(withAgeD)) {
if(withAgeD[r, "allAges"] >= 60) {
g3 <- g3 + 1
if(withAgeD[r, "EXPIRE_FLAG"] == "1") {
g3d <- g3d + 1
}
} else if(withAgeD[r, "allAges"] >= 40) {
g2 <- g2 + 1
if(withAgeD[r, "EXPIRE_FLAG"] == "1") {
g2d <- g2d + 1
}
} else if(withAgeD[r, "allAges"] >= 20) {
g1 <- g1 + 1
if(withAgeD[r, "EXPIRE_FLAG"] == "1") {
g1d <- g1d + 1
}
}
}
matrix <- rbind(c(g1, g1d), c(g2, g2d), c(g3, g3d))
rownames(matrix) <- ageGroups
colnames(matrix) <- c("Total", "Deaths")
ageGroups2 <- c("20-39 no drugs", "40-59 no drugs", "60+ no drugs")
ng1 <- 0
ng2 <- 0
ng3 <- 0
ng1d <- 0
ng2d <- 0
ng3d <- 0
nwithAgeD <- withAge[is.na(withAge[, "DRUG"]),]
for(r in 1:nrow(nwithAgeD)) {
if(nwithAgeD[r, "allAges"] >= 60) {
ng3 <- ng3 + 1
if(nwithAgeD[r, "EXPIRE_FLAG"] == "1") {
ng3d <- ng3d + 1
}
} else if(nwithAgeD[r, "allAges"] >= 40) {
ng2 <- ng2 + 1
if(nwithAgeD[r, "EXPIRE_FLAG"] == "1") {
ng2d <- ng2d + 1
}
} else if(nwithAgeD[r, "allAges"] >= 20) {
ng1 <- ng1 + 1
if(nwithAgeD[r, "EXPIRE_FLAG"] == "1") {
ng1d <- ng1d + 1
}
}
}
matrix2 <- rbind(c(ng1, ng1d), c(ng2, ng2d), c(ng3, ng3d))
rownames(matrix2) <- ageGroups2
colnames(matrix2) <- c("Total", "Deaths")
In doing so, we can then see the ratios of deaths containing drugs of each age gropu more easily.
par(mfrow = c(2,3))
pie(c(ng1, g1), labels = c("No Drug", "Drug"), col=rainbow(2), main ="20-39")
pie(c(ng2, g2), labels = c("No Drug", "Drug"), col=rainbow(2), main ="40-59")
pie(c(ng3, g3), labels = c("No Drug", "Drug"), col=rainbow(2), main ="60+")
pie(c(ng1d, g1d), labels = c("No Drug", "Drug"), col=rainbow(2), main ="20-39 Deaths")
pie(c(ng2d, g2d), labels = c("No Drug", "Drug"), col=rainbow(2), main ="40-59 Deaths")
pie(c(ng3d, g3d), labels = c("No Drug", "Drug"), col=rainbow(2), main ="60+ Deaths")
Interestingly enough, across all age groups we can see that generally the same percent of people who are diagnosed use drugs versus not using drugs. Of those that died, the younger population seems to have a higher percent of deaths when not using drugs.
matrix
## Total Deaths
## 20-39 drugs 50 9
## 40-59 drugs 214 87
## 60+ drugs 584 382
fisher.test(matrix)
##
## Fisher's Exact Test for Count Data
##
## data: matrix
## p-value = 5.905e-06
## alternative hypothesis: two.sided
Testing our matrix which contains the number of people who were treated with drugs and the number that died as a result of the diagnostic, we get a ver ysmall p-value, indicating that there is a relationship between the age group and death including drugs.
Our results point that age is a factor for people who have died from a diagnostic and were administered drugs. Currently, it is not particularly clear whether as a person gets older, drugs contribute to the cause of death, or just that ailment severity increases when getting older. In order to further examine this, we can try compare percentages of drug treatments in those who did not pass.
g1 <- 0
g2 <- 0
g3 <- 0
g1nd <- 0
g2nd <- 0
g3nd <- 0
withAgeD <- withAge[!is.na(withAge[, "DRUG"]),]
for(r in 1:nrow(withAgeD)) {
if(withAgeD[r, "allAges"] >= 60) {
g3 <- g3 + 1
if(withAgeD[r, "EXPIRE_FLAG"] == "0") {
g3nd <- g3nd + 1
}
} else if(withAgeD[r, "allAges"] >= 40) {
g2 <- g2 + 1
if(withAgeD[r, "EXPIRE_FLAG"] == "0") {
g2nd <- g2nd + 1
}
} else if(withAgeD[r, "allAges"] >= 20) {
g1 <- g1 + 1
if(withAgeD[r, "EXPIRE_FLAG"] == "0") {
g1nd <- g1nd + 1
}
}
}
ng1 <- 0
ng2 <- 0
ng3 <- 0
ng1nd <- 0
ng2nd <- 0
ng3nd <- 0
nwithAgeD <- withAge[is.na(withAge[, "DRUG"]),]
for(r in 1:nrow(nwithAgeD)) {
if(nwithAgeD[r, "allAges"] >= 60) {
ng3 <- ng3 + 1
if(nwithAgeD[r, "EXPIRE_FLAG"] == "0") {
ng3nd <- ng3nd + 1
}
} else if(nwithAgeD[r, "allAges"] >= 40) {
ng2 <- ng2 + 1
if(nwithAgeD[r, "EXPIRE_FLAG"] == "0") {
ng2nd <- ng2nd + 1
}
} else if(nwithAgeD[r, "allAges"] >= 20) {
ng1 <- ng1 + 1
if(nwithAgeD[r, "EXPIRE_FLAG"] == "0") {
ng1nd <- ng1nd + 1
}
}
}
In doing so, we can then see the ratios of deaths containing drugs of each age group more easily.
par(mfrow = c(1,3))
pie(c(ng1nd, g1nd), labels = c("No Drug", "Drug"), col=rainbow(2), main ="20-39 No Deaths")
pie(c(ng2nd, g2nd), labels = c("No Drug", "Drug"), col=rainbow(2), main ="40-59 No Deaths")
pie(c(ng3nd, g3nd), labels = c("No Drug", "Drug"), col=rainbow(2), main ="60+ No Deaths")
print(paste("Chance of Surviving 20-39", ng1nd*100/ng1))
## [1] "Chance of Surviving 20-39 66.6666666666667"
print(paste("Chance of Surviving 40-59", ng2nd*100/ng2))
## [1] "Chance of Surviving 40-59 9.09090909090909"
print(paste("Chance of Surviving 60+", ng3nd*100/ng3))
## [1] "Chance of Surviving 60+ 15.1515151515152"
Our graph tells us that middle aged people survived more when they had drugs, however we see an increase in cases in older people (60+) where they survived without the use of any drugs. We see as age goes from young to middle aged, that the chance to survive actually decreases, with those who survived almost all having taken drugs for their illness, which would normally point to mean that as age increases, the body actually needs drug administrations to help it survive, which is not what our hypothesis states. However, if this were true we would see an increase in the ratio of drug usage in those who survived when going from middle aged to old groups while the percent of being cured would go down further, because as the body gets weaker, and drug strength remains constant, we wouldn’t expect as many people to make it. We see quite the opposite however, that more people survive when older and more do so without the use of drugs.
Why is this?
In our results from earlier about the percent of deaths and how they either were or were not administered drugs, we saw that younger people had a higher percent of death in those that were below 60 than above it. This could be due to the fact that people may think younger patients have a higher chance of surviving, because they are younger they are more “full of life” and might not treat them with drugs properly, possibly treating them with too little. As people get older, they expect the chance of death to be higher because of their weakened state, and take more precaution in administering the right kind of drugs because their state is more fragile and responsive.
Hypothesis: Medical treatment for patients is a risk just as much as it is a reward. Patients recieving medical treatment for otherwise recoverable diagnoses are at a greater risk of death, not just because of the serverity of the ailment, but from imperfect administrations that may occur as a consequence of a patient’s happenstance.
Our above analysis was conducted using two of the most prevalent diagnostics: PNEUMONIA, and SEPSIS. In our analysis, we treated SEPSIS as a primary cause of fatalities, and PNEUMONIA as a primary diagnosis that could be recovered from without intense intervention. bY using repeated assocation tests, we found which certain circumstances led patients to be in this heightened state of danger. The two most prevalent results were:
Due to these results, we can accept our hypothesis that there is in fact some inaccurate drug treatments that may increase the chances of patient expiry.
THe usefulness of this statement is to show the precision with which a patient must be treated with. Even though drugs are created to help fix ailments, abusing them can prove to be fatal as well. With bacteria also becoming more resistant to drugs, and different types of drugs being needed to be created every year to combat them, it’s important to be aware that there is no fixed solution on treating illnesses. A lot of research should go not only into how to fight something, but in how to defend against it.
There are a few areas of further work that could be done to better understand the circumstances surrounding the adequate administration of drugs:
In order to validate these reasonings, other tests would also have to be conducted. Most notably, they must be tested on other diagnoses as well. When using PNEUMONIA as the main diagnostic for recoverable diagnoses, I was at first skeptical for finding any cases where a patient survived without the administration of drugs being that pneumonia also has a bacterial strain that can be deadly. In order to better validate with pneumonia, two separate tests would have to be conducted: using viral pneumonia and bacterial pneumonia. However, in the case of bacterial pneumonia, such data that would show cases without drug usage might not be possible since a standard treatment of bacterial pneumonia is the administration of antibiotics. Then, to get a better idea for the use of drugs, one would have to go deeper into just what types of drugs are administered, their quantities, and the patient’s current conditions upon recieving the treatment.
Conversely, other tests for validation can be performed on diagnoses where drugs can be seen as purely optional (except for in extreme cases of course) such as more physical rather than biological accidents. In all cases, further anaylsis could be conducted on the effect drugs can have on a person’s chemical levels that can be related to death.