In your political science class, you have been discussing American politics with your professor and classmates. You are writing an assignment on American presidents and need to do some research. The file uspresidents.csv contains information about forty-five American presidents from George Washington to Donald Trump. The dataset contains information such as the schools they attended, their birth and death dates, the dates they entered and left office, and their party affiliations. In this question you will perform exploratory data analysis on this dataset in R. Give your R code & relevant output to each of the questions below.
Load the dataset in R. Since this question concerns the lengths of the presidents’ terms in office, and their party affiliations, create a new data frame called namesterms that only contains the presidents’ names, the dates when they entered and left office, and their party affiliations. Use head(namesterms) and show the fist 6 rows of namesterms
namesterms <- read.csv("uspresidents.csv")
head(namesterms)
| # | President | Colleges | Birth City | Birth State |
|---|---|---|---|---|
| 1 | George Washington | The College of William and Mary | Mount Vernon | VA |
| 2 | John Adams | Harvard University | Quincy | MA |
| 3 | Thomas Jefferson | The College of William and Mary | Shadwell | VA |
| 4 | James Madison | Princeton University | Port Conway | VA |
| 5 | James Monroe | The College of William and Mary | Monroe Hall | VA |
| 6 | John Quincy Adams | Harvard University, Leiden University | Quincy | MA |
| Birth Date | Death City | Death State | Death Date | |
| 1 | 2/22/1732 | Mount Vernon | VA | 12/14/1799 |
| 2 | 10/30/1735 | Quincy | MA | 7/4/1826 |
| 3 | 4/13/1743 | Charlottesville | VA | 7/4/1826 |
| 4 | 3/16/1751 | Orange | VA | 6/28/1836 |
| 5 | 4/28/1758 | New York | NY | 7/4/1831 |
| 6 | 7/11/1767 | Washington | DC | 2/23/1848 |
| Took Office | Left Office | Party | ||
| 1 | 4/30/1789 | 3/4/1797 | Independent | |
| 2 | 3/4/1797 | 3/4/1801 | Federalist Party | |
| 3 | 3/4/1801 | 3/4/1809 | Democratic-Republican Party | |
| 4 | 3/4/1809 | 3/4/1817 | Democratic-Republican Party | |
| 5 | 3/4/1817 | 3/4/1825 | Democratic-Republican Party | |
| 6 | 3/4/1825 | 3/4/1829 | Democratic-Republican Party |
Part of your assignment will concern the lengths of the presidents’ time in office, in days. Using namesterms, calculate the length of each president’s time in office, and then create an appropriate visual summary of the distribution of the lengths of the
presidents’ stay in office.
dateFormat <- "%m/%d/%Y"
namesterms$Took.office <- as.Date(namesterms$Took.office, format = dateFormat)
namesterms$Left.office <- as.Date(namesterms$Left.office, format = dateFormat)
namesterms$Duration.in.office <- as.numeric(namesterms$Left.office - namesterms$Took.office)
hist(
namesterms$Duration.in.office,
main = "Duration in Office",
xlab = "Days", ylab = "Frequency"
)
Now that you have a plot, you ought to describe the distribution in the body of your assignment. As part of your report, describe four characteristics of the histogram, or the data used to create the histogram.
The data shows 2 clear peaks suggesting it's bimodal, one near 1500 which indicates the group of presidents that only served one term, it additionally shows a second peak at about double that which is likely the group of presidents that served their full 2nd term. There are more occurrences in the first peak suggesting that more presidents server just one term than two terms.
Additionally there is a single outlier on the far end of the duration which was FDR who after his term an amendment was added to the constitution to ensure nobody served over 2 terms. The data set is generally skewed to the left suggesting that there are a few presidents that served very short terms.
As another part of your assignment, you also want to compare the term lengths for Democratic and Republican presidents. Create an appropriate plot for this comparison, and summarize your plot in a couple of sentences.
common_breaks <- seq(0, max(namesterms$Duration.in.office, na.rm=TRUE) + 100, 100)
republican <- hist(
namesterms$Duration.in.office[namesterms$Party == "Republican Party"],
breaks = common_breaks,
plot = FALSE
)
democrat <- hist(
namesterms$Duration.in.office[namesterms$Party == "Democratic Party"],
breaks = common_breaks,
plot = FALSE
)
counts_matrix <- rbind(republican$counts, democrat$counts)
barplot(
counts_matrix,
col=c(rgb(1, 0, 0), rgb(0, 0, 1), rgb(0, 1, 0)),
main="Duration in Office by Party",
xlab="Days",
ylab="Frequency",
legend.text=c("Republican", "Democrat"),
args.legend=list(x="topright"),
names.arg=common_breaks[-length(common_breaks)]
)
Both republicans and democrats cluster around two points (1 full term and 2 full terms), however we can see the outlier on the far right side near 3 terms was a democratic president. Democrats are more represented within the main two peaks, with republicans being more likely to be out of office between their full terms.
Consider the random variable
Derive following. Use first two moments (
Derive the cumulative distribution function (CDF) of
Now suppose we have
Using part (b), and the independence of
Find the PDF of
A standard deck of cards has four suits (hearts, diamonds, clubs, spades), each containing 13 cards: Ace, 2, 3, ... , 10, Jack, Queen, King. Hearts and diamonds are red; spades and clubs are black.
The binomial distribution has a pmf:
and has an expected value
Suppose we randomly choose one card from each suit (so 4 cards in total). Let
What is the probability
Give the pmf of
Suppose you win the face value of the card when the card is 2, 3, ... , 10 (value equals the number on the card), you win nothing when the card is a face card (Jack, Queen, King), you win 20 if the card is a black Ace, and you win 30 if the card is a red Ace. Let
Give the pmf of
Consider the transit schedules for buses 99 and R4. Suppose the number of the 99 bus per hour is normally distributed with mean 12 and variance 2, and that for the R4 is also normal with mean 10 and variance 4. It is also known that the two buses are independent of each other. Suppose we observe the number of buses for each of the 99 and R4 during peak morning rush of 8-9am.
For each part, define the appropriate random variable(s) and explicitly write out their distributions. Specify any assumptions used and where it is used in your work.
We observe the 1 hour block (8-9am) every weekday (M-F) of 1 week. What is the
probability that we see at most 60 of the R4 bus?
For each hour we observe, we look at the difference between the two buses. What is the probability that for each hour the difference is within 3?
What is the expected number of days in 1 week (7 days) that we observe the difference
is within 3 for a particular hour (say, 8-9am) in each day?
A laboratory operates a machine that must be inspected every morning.
Once a failure occurs, the machine is sent for repair. The repair time (in days) follows an Exponential distribution with rate parameter
Define:
What is the distribution of
Find the mean and variance of
Compute