Introduction

As a part of final project for the Data Visualization course, I would like to provide graphical analysis of the hospital charges data for 30 most common outpatient and 100 most common inpatient services at various hospitals across the USA.

The data is present at the following URL:

https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3

The main aim of this project is to develop a data visualization, which can be used to explore treatments data set.

This document explains the data cleaning and pre-processing steps performed to make the data suitable for visualizations using d3.js. At the end of this process, we will create a series of .csv files, which can be readily analyzed by various graphical tools.

Before you knit this rmd, make sure that you installed the following R packages:

Also when you generate HTML document using this RMD, the process may run for some time, since one of the code blocks given below will fetch the latitude and longitude of all hospital addresses using an API, and this code block runs for some tome (5 minutes approximately).

I downloaded this file, to a location in my computer, and read the file’s contents into a data frame.

#setwd("C:/Users/Sekhar/Documents/R Programs/Data Visualization/Project")

df <- read.csv("Hospital_Charges.csv")
df <- data.frame(df)

Data cleaning

Here are the variable names:

names(df)
##  [1] "DRG.Definition"                      
##  [2] "Provider.Id"                         
##  [3] "Provider.Name"                       
##  [4] "Provider.Street.Address"             
##  [5] "Provider.City"                       
##  [6] "Provider.State"                      
##  [7] "Provider.Zip.Code"                   
##  [8] "Hospital.Referral.Region.Description"
##  [9] "Total.Discharges"                    
## [10] "Average.Covered.Charges"             
## [11] "Average.Total.Payments"              
## [12] "Average.Medicare.Payments"

The variables description is given below:

variable <- c("DRG Definition",
"Provider Id",
"Provider Name",
"Provider Street Address",
"Provider City",
"Provider State",
"Provider Zip Code",
"Hospital Referral Region Description",
 "Total Discharges" ,
 "Average Covered Charges", 
 "Average Total Payments" ,
"Average Medicare Payments"
)

description <- c(
"Treatment name"
,"Hospital ID"
,"Hospital Name"
,"Hospital Address"
,"Hospital City"
,"Hospital State"
,"Hospital ZIP Code"
,"Unknown"
,"Number of discharges"
,"Average Billed to insurance"
,"Average Payments made by the patient"
,"Average Payments made by the Medicare"
  )


df_1 <- data.frame(variable, description)
library(knitr)
kable(df_1)
variable description
DRG Definition Treatment name
Provider Id Hospital ID
Provider Name Hospital Name
Provider Street Address Hospital Address
Provider City Hospital City
Provider State Hospital State
Provider Zip Code Hospital ZIP Code
Hospital Referral Region Description Unknown
Total Discharges Number of discharges
Average Covered Charges Average Billed to insurance
Average Total Payments Average Payments made by the patient
Average Medicare Payments Average Payments made by the Medicare

We will use the variable “Average Covered Charges” to represent the Average amount charged by a hospital for a specific treatment. We will ignore the other average variables, since “Average Covered Charges” correctly reflects the treatment cost.

Finally we will just consider the following variables:

c("DRG Definition",
"Provider Id",
"Provider Name",
"Provider Street Address",
"Provider City",
"Provider State",
"Provider Zip Code",
 "Average Covered Charges" 
)
## [1] "DRG Definition"          "Provider Id"            
## [3] "Provider Name"           "Provider Street Address"
## [5] "Provider City"           "Provider State"         
## [7] "Provider Zip Code"       "Average Covered Charges"

We will rename these variables to the following:

variable <- c("DRG Definition",
"Provider Id",
"Provider Name",
"Provider Street Address",
"Provider City",
"Provider State",
"Provider Zip Code",
 "Average Covered Charges" 
)

renamed_variable <- c("Treatment","Hospital_ID", "Hospital_Name", "Hospital_Street_Address",
                      "Hospital_City","Hospital_State","Hospital_ZIP","Charges")

var_map <- data.frame(variable,renamed_variable)
kable(var_map)
variable renamed_variable
DRG Definition Treatment
Provider Id Hospital_ID
Provider Name Hospital_Name
Provider Street Address Hospital_Street_Address
Provider City Hospital_City
Provider State Hospital_State
Provider Zip Code Hospital_ZIP
Average Covered Charges Charges

Let us modify the data frame to ignore the unwanted variables:

df <- df[,c(1,2,3,4,5,6,7,10)]

Rename the variables:

names(df) <- c("Treatment","Hospital_ID", "Hospital_Name", "Hospital_Street_Address",
                      "Hospital_City","Hospital_State","Hospital_ZIP","Charges")

Let us display a sample rows from the data frame:

head(df)
##                                  Treatment Hospital_ID
## 1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10001
## 2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10005
## 3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10006
## 4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10011
## 5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10016
## 6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10023
##                      Hospital_Name    Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER     1108 ROSS CLARK CIRCLE
## 2    MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3   ELIZA COFFEE MEMORIAL HOSPITAL         205 MARENGO STREET
## 4                ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5    SHELBY BAPTIST MEDICAL CENTER    1000 FIRST STREET NORTH
## 6     BAPTIST MEDICAL CENTER SOUTH  2105 EAST SOUTH BOULEVARD
##   Hospital_City Hospital_State Hospital_ZIP   Charges
## 1        DOTHAN             AL        36301 $32963.07
## 2          BOAZ             AL        35957 $15131.85
## 3      FLORENCE             AL        35631 $37560.37
## 4    BIRMINGHAM             AL        35235 $13998.28
## 5     ALABASTER             AL        35007 $31633.27
## 6    MONTGOMERY             AL        36116 $16920.79

In the above display , the Treatment variable has the treatment code and description combined. We will separate this into two parts: the “Treatment_ID” and “Treatment_Description”.

l <- strsplit(as.vector(df$Treatment)," - ")
df$Treatment_ID <- do.call(rbind,l)[,1]
df$Treatment_Description <- do.call(rbind,l)[,2]

df <- df[,c(2,9,10,3:8)]
df$Charges <- as.numeric(substring(df$Charges,2))
df$Treatment_ID <- as.numeric(df$Treatment_ID)

head(df)
##   Hospital_ID Treatment_ID              Treatment_Description
## 1       10001           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2       10005           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3       10006           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4       10011           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5       10016           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6       10023           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
##                      Hospital_Name    Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER     1108 ROSS CLARK CIRCLE
## 2    MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3   ELIZA COFFEE MEMORIAL HOSPITAL         205 MARENGO STREET
## 4                ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5    SHELBY BAPTIST MEDICAL CENTER    1000 FIRST STREET NORTH
## 6     BAPTIST MEDICAL CENTER SOUTH  2105 EAST SOUTH BOULEVARD
##   Hospital_City Hospital_State Hospital_ZIP  Charges
## 1        DOTHAN             AL        36301 32963.07
## 2          BOAZ             AL        35957 15131.85
## 3      FLORENCE             AL        35631 37560.37
## 4    BIRMINGHAM             AL        35235 13998.28
## 5     ALABASTER             AL        35007 31633.27
## 6    MONTGOMERY             AL        36116 16920.79

Tackling with ZIP Codes

Some ZIP codes in the data set have just 4 digits (since the ZIP codes beginning with 0 (such as 08536) are represented as numbers, and the leading 0 is dropped). We need to prefix zero to such zip codes:

correct_ZIP <- function(x)
{
   l <- nchar(x)
   if(l < 5)
    {
      return (paste(paste(rep("0",5-l),sep=""),x,sep=""))
    }
    else
      return (x)
}


df$Corrected_ZIP <- sapply(df$Hospital_ZIP,FUN="correct_ZIP")
df <- data.frame(df)

Tackling with addresses

Let us create another variable in the data frame, to contain the complete hospital’s address:

#df$Address <- do.call(paste, c(df[c("Hospital_Name", "Hospital_Street_Address", "Hospital_City","Hospital_State","Corrected_ZIP")], sep = ","))

df$Address <- do.call(paste, c(df[c("Hospital_Street_Address", "Hospital_City","Hospital_State","Corrected_ZIP")], sep = ","))

df$Alt_Address <- do.call(paste, c(df[c("Hospital_Name", "Hospital_City","Hospital_State","Corrected_ZIP")], sep = ","))

library(stringr)

df$Address <- str_replace(df$Address,"&"," ")
df$Alt_Address <- str_replace(df$Alt_Address,"&"," ")

df$Address <- str_replace(df$Address,"#"," ")
df$Alt_Address <- str_replace(df$Alt_Address,"#"," ")


#gsub()
head(df)
##   Hospital_ID Treatment_ID              Treatment_Description
## 1       10001           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2       10005           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3       10006           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4       10011           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5       10016           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6       10023           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
##                      Hospital_Name    Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER     1108 ROSS CLARK CIRCLE
## 2    MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3   ELIZA COFFEE MEMORIAL HOSPITAL         205 MARENGO STREET
## 4                ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5    SHELBY BAPTIST MEDICAL CENTER    1000 FIRST STREET NORTH
## 6     BAPTIST MEDICAL CENTER SOUTH  2105 EAST SOUTH BOULEVARD
##   Hospital_City Hospital_State Hospital_ZIP  Charges Corrected_ZIP
## 1        DOTHAN             AL        36301 32963.07         36301
## 2          BOAZ             AL        35957 15131.85         35957
## 3      FLORENCE             AL        35631 37560.37         35631
## 4    BIRMINGHAM             AL        35235 13998.28         35235
## 5     ALABASTER             AL        35007 31633.27         35007
## 6    MONTGOMERY             AL        36116 16920.79         36116
##                                          Address
## 1         1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2       2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3           205 MARENGO STREET,FLORENCE,AL,35631
## 4 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5     1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6  2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
##                                        Alt_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301
## 2      MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631
## 4            ST VINCENT'S EAST,BIRMINGHAM,AL,35235
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116

Adding latitude and longitude variables

Let us add latitude and longitude variables for each of the hospital’s address to the data frame.

We have duplicate addresses in the data frame (since a hospital is represented many times in the data frame, for different treatments):

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
distinct_address <- df %>%
  distinct(Hospital_ID, Address)
head(distinct_address)
##   Hospital_ID Treatment_ID              Treatment_Description
## 1       10001           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2       10005           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3       10006           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4       10011           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5       10016           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6       10023           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
##                      Hospital_Name    Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER     1108 ROSS CLARK CIRCLE
## 2    MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3   ELIZA COFFEE MEMORIAL HOSPITAL         205 MARENGO STREET
## 4                ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5    SHELBY BAPTIST MEDICAL CENTER    1000 FIRST STREET NORTH
## 6     BAPTIST MEDICAL CENTER SOUTH  2105 EAST SOUTH BOULEVARD
##   Hospital_City Hospital_State Hospital_ZIP  Charges Corrected_ZIP
## 1        DOTHAN             AL        36301 32963.07         36301
## 2          BOAZ             AL        35957 15131.85         35957
## 3      FLORENCE             AL        35631 37560.37         35631
## 4    BIRMINGHAM             AL        35235 13998.28         35235
## 5     ALABASTER             AL        35007 31633.27         35007
## 6    MONTGOMERY             AL        36116 16920.79         36116
##                                          Address
## 1         1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2       2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3           205 MARENGO STREET,FLORENCE,AL,35631
## 4 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5     1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6  2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
##                                        Alt_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301
## 2      MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631
## 4            ST VINCENT'S EAST,BIRMINGHAM,AL,35235
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116
distinct_address <- distinct_address[c("Hospital_ID", "Address","Alt_Address")]

distinct_address[grepl("box",distinct_address$Address,ignore.case=TRUE),]$Address <- 
  distinct_address[grepl("box",distinct_address$Address,ignore.case=TRUE),]$Alt_Address

head(distinct_address)
##   Hospital_ID                                        Address
## 1       10001         1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2       10005       2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3       10006           205 MARENGO STREET,FLORENCE,AL,35631
## 4       10011 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5       10016     1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6       10023  2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
##                                        Alt_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301
## 2      MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631
## 4            ST VINCENT'S EAST,BIRMINGHAM,AL,35235
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116

So the \(distinct\mbox{_}address\) data frame contains the addresses of all the hospitals in the data frame, represented only once.

We will use the http://www.datasciencetoolkit.org API to get the latitude and longitude details of the hospital’s address. The given R code below will run for a while, since we have 3337 distinct addresses. This code uses the function street2coordinates of RDSTK package (to query http://www.datasciencetoolkit.org).

#install.packages("RDSTK")
library(RDSTK)
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## Loading required package: rjson
## Loading required package: RCurl
## Loading required package: bitops
lat <- vector()
lon <- vector()

print("The progress of the job is displayed below:")
## [1] "The progress of the job is displayed below:"
for(i in 1:nrow(distinct_address))
{
  
  r <- tryCatch(street2coordinates(distinct_address$Address[i],session=getCurlHandle()),error=function(e) NULL)
  if(is.null(r))
    {
      r <- tryCatch(street2coordinates(distinct_address$Alt_Address[i],session=getCurlHandle()),error=function(e) NULL)
          if(is.null(r)){
              lat[i] <- 0
               lon[i] <- 0
              }
           else{
               if(nrow(r) == 0){
                 lat[i] <- 0
                 lon[i] <- 0
               }
               else {
                 
               
               lat[i] <- r$latitude
               lon[i] <- r$longitude
                 }
           }
    }
  else{
    if(nrow(r) == 0)
      {
         lat[i] <- 0
         lon[i] <- 0
      }
    else{
  lat[i] <- r$latitude
  lon[i] <- r$longitude
  }
  }
  if(i%%100 == 0)
    print(i)

}
## [1] 100
## [1] 200
## [1] 300
## [1] 400
## [1] 500
## [1] 600
## [1] 700
## [1] 800
## [1] 900
## [1] 1000
## [1] 1100
## [1] 1200
## [1] 1300
## [1] 1400
## [1] 1500
## [1] 1600
## [1] 1700
## [1] 1800
## [1] 1900
## [1] 2000
## [1] 2100
## [1] 2200
## [1] 2300
## [1] 2400
## [1] 2500
## [1] 2600
## [1] 2700
## [1] 2800
## [1] 2900
## [1] 3000
## [1] 3100
## [1] 3200
## [1] 3300
distinct_address$Latitude <- lat
distinct_address$Longitude <- lon

head(distinct_address)
##   Hospital_ID                                        Address
## 1       10001         1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2       10005       2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3       10006           205 MARENGO STREET,FLORENCE,AL,35631
## 4       10011 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5       10016     1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6       10023  2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
##                                        Alt_Address Latitude Longitude
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301 31.21545 -85.36149
## 2      MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957 34.22156 -86.15946
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631 34.79516 -87.68506
## 4            ST VINCENT'S EAST,BIRMINGHAM,AL,35235 33.58676 -86.68096
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007 33.24504 -86.81598
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116 32.32724 -86.27519

As per the logic of the program, if we did not find the latitude and longitude of an address, then we populated the latitude and longitude with zero values. Let us identify how many such rows we have.

nrow(distinct_address[(distinct_address$Longitude == 0 & distinct_address$Latitude == 0),])
## [1] 12

So we have 12 rows, which do not have latitude and longitude. We will ignore such rows when we do geographical analysis using maps.

Let us append the latitude and longitude information to the addresses in the data frame that has all the hospital charges:

df <- inner_join(df,distinct_address,by="Hospital_ID" )
df <- df[,c(1:12,15,16)]
names(df)[c(3,11:12)] <- c("Treatment_Description","Address","Alt_ddress")
head(df)
##   Hospital_ID Treatment_ID              Treatment_Description
## 1       10001           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2       10005           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3       10006           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4       10011           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5       10016           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6       10023           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
##                      Hospital_Name    Hospital_Street_Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER     1108 ROSS CLARK CIRCLE
## 2    MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3   ELIZA COFFEE MEMORIAL HOSPITAL         205 MARENGO STREET
## 4                ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5    SHELBY BAPTIST MEDICAL CENTER    1000 FIRST STREET NORTH
## 6     BAPTIST MEDICAL CENTER SOUTH  2105 EAST SOUTH BOULEVARD
##   Hospital_City Hospital_State Hospital_ZIP  Charges Corrected_ZIP
## 1        DOTHAN             AL        36301 32963.07         36301
## 2          BOAZ             AL        35957 15131.85         35957
## 3      FLORENCE             AL        35631 37560.37         35631
## 4    BIRMINGHAM             AL        35235 13998.28         35235
## 5     ALABASTER             AL        35007 31633.27         35007
## 6    MONTGOMERY             AL        36116 16920.79         36116
##                                          Address
## 1         1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301
## 2       2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957
## 3           205 MARENGO STREET,FLORENCE,AL,35631
## 4 50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,AL,35235
## 5     1000 FIRST STREET NORTH,ALABASTER,AL,35007
## 6  2105 EAST SOUTH BOULEVARD,MONTGOMERY,AL,36116
##                                         Alt_ddress Latitude Longitude
## 1 SOUTHEAST ALABAMA MEDICAL CENTER,DOTHAN,AL,36301 31.21545 -85.36149
## 2      MARSHALL MEDICAL CENTER SOUTH,BOAZ,AL,35957 34.22156 -86.15946
## 3 ELIZA COFFEE MEMORIAL HOSPITAL,FLORENCE,AL,35631 34.79516 -87.68506
## 4            ST VINCENT'S EAST,BIRMINGHAM,AL,35235 33.58676 -86.68096
## 5 SHELBY BAPTIST MEDICAL CENTER,ALABASTER,AL,35007 33.24504 -86.81598
## 6 BAPTIST MEDICAL CENTER SOUTH,MONTGOMERY,AL,36116 32.32724 -86.27519

Let us get the percentile ranking of charges grouped by treatment:

df <- df %>%
  group_by(Treatment_ID) %>%
  mutate(percentile = rank(Charges)/length(Charges))


df_1 <- data.frame(Treatment_ID=df$Treatment_ID,Charges=df$Charges)
df_1 <- df_1 %>%
  group_by(Treatment_ID) %>%
  mutate(rank = rank(Charges)) 

max(df_1$rank/length(df_1$rank)*100)
## [1] 100
typeof(df)
## [1] "list"

Let us save this to a file:

write.csv(df,file="Treatment_modified.csv",row.names=FALSE)

Let us identify how many rows of the main data frame has Latitude and Longitude as 0:

nrow(df[(df$Latitude == 0 & df$Longitude == 0),])
## [1] 517

Hence we have to eliminate 517 rows when we do geographical data analysis using maps. This count of records (with 0 latitude and longitude) is not significant, when compared to the number of records in the actual data set 163065

We will create another file, that has distinct treatment names, and their codes. This file will be used later to populate drop down lists in the d3 or googlevis code.

head(df)
## Source: local data frame [6 x 15]
## Groups: Treatment_ID
## 
##   Hospital_ID Treatment_ID              Treatment_Description
## 1       10001           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2       10005           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 3       10006           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 4       10011           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 5       10016           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## 6       10023           39 EXTRACRANIAL PROCEDURES W/O CC/MCC
## Variables not shown: Hospital_Name (fctr), Hospital_Street_Address (fctr),
##   Hospital_City (fctr), Hospital_State (fctr), Hospital_ZIP (int), Charges
##   (dbl), Corrected_ZIP (chr), Address (chr), Alt_ddress (chr), Latitude
##   (dbl), Longitude (dbl), percentile (dbl)
distinct_treatment <- df %>%
  distinct(Treatment_ID,Treatment_Description)

distinct_treatment <- distinct_treatment[,c("Treatment_ID","Treatment_Description")]
distinct_treatment
## Source: local data frame [100 x 2]
## Groups: Treatment_ID
## 
##    Treatment_ID                                     Treatment_Description
## 1            39                        EXTRACRANIAL PROCEDURES W/O CC/MCC
## 2            57             DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC
## 3            64      INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC
## 4            65       INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC
## 5            66 INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC
## 6            69                                        TRANSIENT ISCHEMIA
## 7            74              CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC
## 8           101                                          SEIZURES W/O MCC
## 9           149                                            DYSEQUILIBRIUM
## 10          176                                PULMONARY EMBOLISM W/O MCC
## ..          ...                                                       ...
distinct_treatment <- distinct_treatment[order(distinct_treatment$Treatment_Description),]

Let us save this to a file:

write.csv(distinct_treatment,file="Distinct_treatments.csv",row.names=FALSE)

Conclusion

This data cleaning method has produced 2 .csv files (distinct_treatment.csv and Treatment_modified.csv). We will use d3.js to implement a GUI which helps us to answer the following questions: