As a part of final project for the Data Visualization course, I would like to provide graphical analysis of the hospital charges data for 30 most common outpatient and 100 most common inpatient services at various hospitals across the USA.
The data is present at the following URL:
https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3
The main aim of this project is to develop a data visualization, which can be used to explore treatments data set.
This document explains the data cleaning and pre-processing steps performed to make the data suitable for visualizations using d3.js. At the end of this process, we will create a series of .csv files, which can be readily analyzed by various graphical tools.
Before you knit this rmd, make sure that you installed the following R packages:
knitr
stringr
dplyr
RDSTK
Also when you generate HTML document using this RMD, the process may run for some time, since one of the code blocks given below will fetch the latitude and longitude of all hospital addresses using an API, and this code block runs for some tome (5 minutes approximately).
I downloaded this file, to a location in my computer, and read the file’s contents into a data frame.
#setwd("C:/Users/Sekhar/Documents/R Programs/Data Visualization/Project")
df <- read.csv("Hospital_Charges.csv")
df <- data.frame(df)
Here are the variable names:
names(df)
## [1] "DRG.Definition"
## [2] "Provider.Id"
## [3] "Provider.Name"
## [4] "Provider.Street.Address"
## [5] "Provider.City"
## [6] "Provider.State"
## [7] "Provider.Zip.Code"
## [8] "Hospital.Referral.Region.Description"
## [9] "Total.Discharges"
## [10] "Average.Covered.Charges"
## [11] "Average.Total.Payments"
## [12] "Average.Medicare.Payments"
The variables description is given below:
variable <- c("DRG Definition",
"Provider Id",
"Provider Name",
"Provider Street Address",
"Provider City",
"Provider State",
"Provider Zip Code",
"Hospital Referral Region Description",
"Total Discharges" ,
"Average Covered Charges",
"Average Total Payments" ,
"Average Medicare Payments"
)
description <- c(
"Treatment name"
,"Hospital ID"
,"Hospital Name"
,"Hospital Address"
,"Hospital City"
,"Hospital State"
,"Hospital ZIP Code"
,"Unknown"
,"Number of discharges"
,"Average Billed to insurance"
,"Average Payments made by the patient"
,"Average Payments made by the Medicare"
)
df_1 <- data.frame(variable, description)
library(knitr)
kable(df_1)
variable | description |
---|---|
DRG Definition | Treatment name |
Provider Id | Hospital ID |
Provider Name | Hospital Name |
Provider Street Address | Hospital Address |
Provider City | Hospital City |
Provider State | Hospital State |
Provider Zip Code | Hospital ZIP Code |
Hospital Referral Region Description | Unknown |
Total Discharges | Number of discharges |
Average Covered Charges | Average Billed to insurance |
Average Total Payments | Average Payments made by the patient |
Average Medicare Payments | Average Payments made by the Medicare |
We will use the variable “Average Covered Charges” to represent the Average amount charged by a hospital for a specific treatment. We will ignore the other average variables, since “Average Covered Charges” correctly reflects the treatment cost.
Finally we will just consider the following variables:
c("DRG Definition",
"Provider Id",
"Provider Name",
"Provider Street Address",
"Provider City",
"Provider State",
"Provider Zip Code",
"Average Covered Charges"
)
## [1] "DRG Definition" "Provider Id"
## [3] "Provider Name" "Provider Street Address"
## [5] "Provider City" "Provider State"
## [7] "Provider Zip Code" "Average Covered Charges"
We will rename these variables to the following:
variable <- c("DRG Definition",
"Provider Id",
"Provider Name",
"Provider Street Address",
"Provider City",
"Provider State",
"Provider Zip Code",
"Average Covered Charges"
)
renamed_variable <- c("Treatment","Hospital_ID", "Hospital_Name", "Hospital_Street_Address",
"Hospital_City","Hospital_State","Hospital_ZIP","Charges")
var_map <- data.frame(variable,renamed_variable)
kable(var_map)
variable | renamed_variable |
---|---|
DRG Definition | Treatment |
Provider Id | Hospital_ID |
Provider Name | Hospital_Name |
Provider Street Address | Hospital_Street_Address |
Provider City | Hospital_City |
Provider State | Hospital_State |
Provider Zip Code | Hospital_ZIP |
Average Covered Charges | Charges |
Let us modify the data frame to ignore the unwanted variables:
df <- df[,c(1,2,3,4,5,6,7,10)]
Rename the variables:
names(df) <- c("Treatment","Hospital_ID", "Hospital_Name", "Hospital_Street_Address",
"Hospital_City","Hospital_State","Hospital_ZIP","Charges")
Let us display a sample rows from the data frame:
head(df)
## Treatment Hospital_ID
## 1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001
## 2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005
## 3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006
## 4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011
## 5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016
## 6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10023
## Hospital_Name Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE
## 2 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET
## 4 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH
## 6 BAPTIST MEDICAL CENTER SOUTH 2105 EAST SOUTH BOULEVARD
## Hospital_City Hospital_State Hospital_ZIP Charges
## 1 DOTHAN AL 36301 $32963.07
## 2 BOAZ AL 35957 $15131.85
## 3 FLORENCE AL 35631 $37560.37
## 4 BIRMINGHAM AL 35235 $13998.28
## 5 ALABASTER AL 35007 $31633.27
## 6 MONTGOMERY AL 36116 $16920.79
In the above display , the Treatment variable has the treatment code and description combined. We will separate this into two parts: the “Treatment_ID” and “Treatment_Description”.
l <- strsplit(as.vector(df$Treatment)," - ")
df$Treatment_ID <- do.call(rbind,l)[,1]
df$Treatment_Description <- do.call(rbind,l)[,2]
df <- df[,c(2,9,10,3:8)]
df$Charges <- as.numeric(substring(df$Charges,2))
df$Treatment_ID <- as.numeric(df$Treatment_ID)
head(df)
## Hospital_ID Treatment_ID Treatment_Description
## 1 10001 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2 10005 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3 10006 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4 10011 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5 10016 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6 10023 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## Hospital_Name Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE
## 2 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET
## 4 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH
## 6 BAPTIST MEDICAL CENTER SOUTH 2105 EAST SOUTH BOULEVARD
## Hospital_City Hospital_State Hospital_ZIP Charges
## 1 DOTHAN AL 36301 32963.07
## 2 BOAZ AL 35957 15131.85
## 3 FLORENCE AL 35631 37560.37
## 4 BIRMINGHAM AL 35235 13998.28
## 5 ALABASTER AL 35007 31633.27
## 6 MONTGOMERY AL 36116 16920.79
Some ZIP codes in the data set have just 4 digits (since the ZIP codes beginning with 0 (such as 08536) are represented as numbers, and the leading 0 is dropped). We need to prefix zero to such zip codes:
correct_ZIP <- function(x)
{
l <- nchar(x)
if(l < 5)
{
return (paste(paste(rep("0",5-l),sep=""),x,sep=""))
}
else
return (x)
}
df$Corrected_ZIP <- sapply(df$Hospital_ZIP,FUN="correct_ZIP")
df <- data.frame(df)
Let us create another variable in the data frame, to contain the complete hospital’s address:
#df$Address <- do.call(paste, c(df[c("Hospital_Name", "Hospital_Street_Address", "Hospital_City","Hospital_State","Corrected_ZIP")], sep = ","))
df$Address <- do.call(paste, c(df[c("Hospital_Street_Address", "Hospital_City","Hospital_State","Corrected_ZIP")], sep = ","))
df$Alt_Address <- do.call(paste, c(df[c("Hospital_Name", "Hospital_City","Hospital_State","Corrected_ZIP")], sep = ","))
library(stringr)
df$Address <- str_replace(df$Address,"&"," ")
df$Alt_Address <- str_replace(df$Alt_Address,"&"," ")
df$Address <- str_replace(df$Address,"#"," ")
df$Alt_Address <- str_replace(df$Alt_Address,"#"," ")
#gsub()
head(df)
## Hospital_ID Treatment_ID Treatment_Description
## 1 10001 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2 10005 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3 10006 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4 10011 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5 10016 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6 10023 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## Hospital_Name Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE
## 2 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET
## 4 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH
## 6 BAPTIST MEDICAL CENTER SOUTH 2105 EAST SOUTH BOULEVARD
## Hospital_City Hospital_State Hospital_ZIP Charges Corrected_ZIP
## 1 DOTHAN AL 36301 32963.07 36301
## 2 BOAZ AL 35957 15131.85 35957
## 3 FLORENCE AL 35631 37560.37 35631
## 4 BIRMINGHAM AL 35235 13998.28 35235
## 5 ALABASTER AL 35007 31633.27 35007
## 6 MONTGOMERY AL 36116 16920.79 36116
## Address
## 1 1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2 2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3 205 MARENGO STREET,FLORENCE,AL,35631
## 4 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5 1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6 2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
## Alt_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301
## 2 MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631
## 4 ST VINCENT'S EAST,BIRMINGHAM,AL,35235
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116
Let us add latitude and longitude variables for each of the hospital’s address to the data frame.
We have duplicate addresses in the data frame (since a hospital is represented many times in the data frame, for different treatments):
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
distinct_address <- df %>%
distinct(Hospital_ID, Address)
head(distinct_address)
## Hospital_ID Treatment_ID Treatment_Description
## 1 10001 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2 10005 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3 10006 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4 10011 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5 10016 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6 10023 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## Hospital_Name Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE
## 2 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET
## 4 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH
## 6 BAPTIST MEDICAL CENTER SOUTH 2105 EAST SOUTH BOULEVARD
## Hospital_City Hospital_State Hospital_ZIP Charges Corrected_ZIP
## 1 DOTHAN AL 36301 32963.07 36301
## 2 BOAZ AL 35957 15131.85 35957
## 3 FLORENCE AL 35631 37560.37 35631
## 4 BIRMINGHAM AL 35235 13998.28 35235
## 5 ALABASTER AL 35007 31633.27 35007
## 6 MONTGOMERY AL 36116 16920.79 36116
## Address
## 1 1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2 2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3 205 MARENGO STREET,FLORENCE,AL,35631
## 4 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5 1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6 2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
## Alt_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301
## 2 MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631
## 4 ST VINCENT'S EAST,BIRMINGHAM,AL,35235
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116
distinct_address <- distinct_address[c("Hospital_ID", "Address","Alt_Address")]
distinct_address[grepl("box",distinct_address$Address,ignore.case=TRUE),]$Address <-
distinct_address[grepl("box",distinct_address$Address,ignore.case=TRUE),]$Alt_Address
head(distinct_address)
## Hospital_ID Address
## 1 10001 1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2 10005 2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3 10006 205 MARENGO STREET,FLORENCE,AL,35631
## 4 10011 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5 10016 1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6 10023 2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
## Alt_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301
## 2 MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631
## 4 ST VINCENT'S EAST,BIRMINGHAM,AL,35235
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116
So the \(distinct\mbox{_}address\) data frame contains the addresses of all the hospitals in the data frame, represented only once.
We will use the http://www.datasciencetoolkit.org API to get the latitude and longitude details of the hospital’s address. The given R code below will run for a while, since we have 3337 distinct addresses. This code uses the function street2coordinates of RDSTK package (to query http://www.datasciencetoolkit.org).
#install.packages("RDSTK")
library(RDSTK)
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## Loading required package: rjson
## Loading required package: RCurl
## Loading required package: bitops
lat <- vector()
lon <- vector()
print("The progress of the job is displayed below:")
## [1] "The progress of the job is displayed below:"
for(i in 1:nrow(distinct_address))
{
r <- tryCatch(street2coordinates(distinct_address$Address[i],session=getCurlHandle()),error=function(e) NULL)
if(is.null(r))
{
r <- tryCatch(street2coordinates(distinct_address$Alt_Address[i],session=getCurlHandle()),error=function(e) NULL)
if(is.null(r)){
lat[i] <- 0
lon[i] <- 0
}
else{
if(nrow(r) == 0){
lat[i] <- 0
lon[i] <- 0
}
else {
lat[i] <- r$latitude
lon[i] <- r$longitude
}
}
}
else{
if(nrow(r) == 0)
{
lat[i] <- 0
lon[i] <- 0
}
else{
lat[i] <- r$latitude
lon[i] <- r$longitude
}
}
if(i%%100 == 0)
print(i)
}
## [1] 100
## [1] 200
## [1] 300
## [1] 400
## [1] 500
## [1] 600
## [1] 700
## [1] 800
## [1] 900
## [1] 1000
## [1] 1100
## [1] 1200
## [1] 1300
## [1] 1400
## [1] 1500
## [1] 1600
## [1] 1700
## [1] 1800
## [1] 1900
## [1] 2000
## [1] 2100
## [1] 2200
## [1] 2300
## [1] 2400
## [1] 2500
## [1] 2600
## [1] 2700
## [1] 2800
## [1] 2900
## [1] 3000
## [1] 3100
## [1] 3200
## [1] 3300
distinct_address$Latitude <- lat
distinct_address$Longitude <- lon
head(distinct_address)
## Hospital_ID Address
## 1 10001 1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2 10005 2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3 10006 205 MARENGO STREET,FLORENCE,AL,35631
## 4 10011 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5 10016 1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6 10023 2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
## Alt_Address Latitude Longitude
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301 31.21545 -85.36149
## 2 MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957 34.22156 -86.15946
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631 34.79516 -87.68506
## 4 ST VINCENT'S EAST,BIRMINGHAM,AL,35235 33.58676 -86.68096
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007 33.24504 -86.81598
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116 32.32724 -86.27519
As per the logic of the program, if we did not find the latitude and longitude of an address, then we populated the latitude and longitude with zero values. Let us identify how many such rows we have.
nrow(distinct_address[(distinct_address$Longitude == 0 & distinct_address$Latitude == 0),])
## [1] 12
So we have 12 rows, which do not have latitude and longitude. We will ignore such rows when we do geographical analysis using maps.
Let us append the latitude and longitude information to the addresses in the data frame that has all the hospital charges:
df <- inner_join(df,distinct_address,by="Hospital_ID" )
df <- df[,c(1:12,15,16)]
names(df)[c(3,11:12)] <- c("Treatment_Description","Address","Alt_ddress")
head(df)
## Hospital_ID Treatment_ID Treatment_Description
## 1 10001 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2 10005 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3 10006 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4 10011 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5 10016 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6 10023 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## Hospital_Name Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE
## 2 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET
## 4 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH
## 6 BAPTIST MEDICAL CENTER SOUTH 2105 EAST SOUTH BOULEVARD
## Hospital_City Hospital_State Hospital_ZIP Charges Corrected_ZIP
## 1 DOTHAN AL 36301 32963.07 36301
## 2 BOAZ AL 35957 15131.85 35957
## 3 FLORENCE AL 35631 37560.37 35631
## 4 BIRMINGHAM AL 35235 13998.28 35235
## 5 ALABASTER AL 35007 31633.27 35007
## 6 MONTGOMERY AL 36116 16920.79 36116
## Address
## 1 1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2 2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3 205 MARENGO STREET,FLORENCE,AL,35631
## 4 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5 1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6 2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
## Alt_ddress Latitude Longitude
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301 31.21545 -85.36149
## 2 MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957 34.22156 -86.15946
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631 34.79516 -87.68506
## 4 ST VINCENT'S EAST,BIRMINGHAM,AL,35235 33.58676 -86.68096
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007 33.24504 -86.81598
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116 32.32724 -86.27519
Let us get the percentile ranking of charges grouped by treatment:
df <- df %>%
group_by(Treatment_ID) %>%
mutate(percentile = rank(Charges)/length(Charges))
df_1 <- data.frame(Treatment_ID=df$Treatment_ID,Charges=df$Charges)
df_1 <- df_1 %>%
group_by(Treatment_ID) %>%
mutate(rank = rank(Charges))
max(df_1$rank/length(df_1$rank)*100)
## [1] 100
typeof(df)
## [1] "list"
Let us save this to a file:
write.csv(df,file="Treatment_modified.csv",row.names=FALSE)
Let us identify how many rows of the main data frame has Latitude and Longitude as 0:
nrow(df[(df$Latitude == 0 & df$Longitude == 0),])
## [1] 517
Hence we have to eliminate 517 rows when we do geographical data analysis using maps. This count of records (with 0 latitude and longitude) is not significant, when compared to the number of records in the actual data set 163065
We will create another file, that has distinct treatment names, and their codes. This file will be used later to populate drop down lists in the d3 or googlevis code.
head(df)
## Source: local data frame [6 x 15]
## Groups: Treatment_ID
##
## Hospital_ID Treatment_ID Treatment_Description
## 1 10001 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2 10005 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3 10006 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4 10011 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5 10016 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6 10023 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## Variables not shown: Hospital_Name (fctr), Hospital_Street_Address (fctr),
## Hospital_City (fctr), Hospital_State (fctr), Hospital_ZIP (int), Charges
## (dbl), Corrected_ZIP (chr), Address (chr), Alt_ddress (chr), Latitude
## (dbl), Longitude (dbl), percentile (dbl)
distinct_treatment <- df %>%
distinct(Treatment_ID,Treatment_Description)
distinct_treatment <- distinct_treatment[,c("Treatment_ID","Treatment_Description")]
distinct_treatment
## Source: local data frame [100 x 2]
## Groups: Treatment_ID
##
## Treatment_ID Treatment_Description
## 1 39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2 57 DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC
## 3 64 INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC
## 4 65 INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC
## 5 66 INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC
## 6 69 TRANSIENT ISCHEMIA
## 7 74 CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC
## 8 101 SEIZURES W/O MCC
## 9 149 DYSEQUILIBRIUM
## 10 176 PULMONARY EMBOLISM W/O MCC
## .. ... ...
distinct_treatment <- distinct_treatment[order(distinct_treatment$Treatment_Description),]
Let us save this to a file:
write.csv(distinct_treatment,file="Distinct_treatments.csv",row.names=FALSE)
This data cleaning method has produced 2 .csv files (distinct_treatment.csv and Treatment_modified.csv). We will use d3.js to implement a GUI which helps us to answer the following questions:
Is there any difference in the cost of a treatment charged by various hospitals across the USA?“.
Get the hospitals list which are within a chosen range (% of the cost) of a selected hospital. The user should be able to get the list of hospitals which are within \(\pm\) a specific percentage of the cost of a treatment at a selected hospital.