Spring-2017/Data607Project2.Rmd at master · ann2014/Spring-2017 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
title: "Data 607 Project 2"
author: "Ann Liu-Ferrara"
date: "March 8, 2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Project 2

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You're encouraged to use a "wide" structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

Perform the analysis requested in the discussion item.

Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Data Source 1: Hospital Readmissions Reduction Program

Background: In October 2012, CMS began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.

```{r echo = TRUE}
library(tidyr)
library(ggplot2)
library(plyr)
library(dplyr)
library(tibble)

# Medicaid inpatient readmission reduce program data
# Source: https://data.medicare.gov/Hospital-Compare/Hospital-Readmissions-Reduction-Program/9n3s-kdb3#SaveAs

dataMed <- read.csv('Hospital_Readmissions_Reduction_Program.csv',
     na.strings = c("Not Available", "NA", "Too Few to Report"),
        header = TRUE, stringsAsFactors = FALSE)
dim(dataMed)
str(dataMed)
head(dataMed, 3)
tail(dataMed, 3)

as_tibble(dataMed)

# create ratio of number of readmission to total of discharges
df <- dataMed %>%
        select(Hospital.Name, Measure.Name, Number.of.Discharges, Number.of.Readmissions, State) %>%
            mutate(ratio.Readmi.dischar = round(Number.of.Readmissions / Number.of.Discharges, 2))

# Readmissions by procedure
ggplot(df, aes(x = Measure.Name, y = Number.of.Readmissions)) +
  geom_boxplot() + coord_flip()

# Ratio of readmission to discharge by procedure
ggplot(df, aes(x = Measure.Name, y = ratio.Readmi.dischar)) +
  geom_boxplot() + coord_flip()

# top 5 hospitals with the highest ratio of readmission to discharge
top_df <- df %>%
        dplyr::group_by(Hospital.Name) %>%
          dplyr::summarise(ratio = round(sum(Number.of.Readmissions, na.rm = TRUE)/sum(Number.of.Discharges, na.rm = TRUE), 2)) %>%
            arrange(desc(ratio))  %>%
              top_n(5)

ggplot(top_df, aes(reorder(Hospital.Name, ratio), ratio)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip()


```

### Conclusion on Data set 1: the analysis aims at finding the top 5 hospitals that have the highest readmission ratio, also visualizing the readmissions by procedure and by readmission rate by procedure.