Eroxl's Notes
Written Assignment 1 (STAT 254)

Question 1

In your political science class, you have been discussing American politics with your professor and classmates. You are writing an assignment on American presidents and need to do some research. The file uspresidents.csv contains information about forty-five American presidents from George Washington to Donald Trump. The dataset contains information such as the schools they attended, their birth and death dates, the dates they entered and left office, and their party affiliations. In this question you will perform exploratory data analysis on this dataset in R. Give your R code & relevant output to each of the questions below.

(a)

Load the dataset in R. Since this question concerns the lengths of the presidents’ terms in office, and their party affiliations, create a new data frame called namesterms that only contains the presidents’ names, the dates when they entered and left office, and their party affiliations. Use head(namesterms) and show the fist 6 rows of namesterms

namesterms <- read.csv("uspresidents.csv")

head(namesterms)
# President Colleges Birth City Birth State
1 George Washington The College of William and Mary Mount Vernon VA
2 John Adams Harvard University Quincy MA
3 Thomas Jefferson The College of William and Mary Shadwell VA
4 James Madison Princeton University Port Conway VA
5 James Monroe The College of William and Mary Monroe Hall VA
6 John Quincy Adams Harvard University, Leiden University Quincy MA
Birth Date Death City Death State Death Date
1 2/22/1732 Mount Vernon VA 12/14/1799
2 10/30/1735 Quincy MA 7/4/1826
3 4/13/1743 Charlottesville VA 7/4/1826
4 3/16/1751 Orange VA 6/28/1836
5 4/28/1758 New York NY 7/4/1831
6 7/11/1767 Washington DC 2/23/1848
Took Office Left Office Party
1 4/30/1789 3/4/1797 Independent
2 3/4/1797 3/4/1801 Federalist Party
3 3/4/1801 3/4/1809 Democratic-Republican Party
4 3/4/1809 3/4/1817 Democratic-Republican Party
5 3/4/1817 3/4/1825 Democratic-Republican Party
6 3/4/1825 3/4/1829 Democratic-Republican Party

(b)

Part of your assignment will concern the lengths of the presidents’ time in office, in days. Using namesterms, calculate the length of each president’s time in office, and then create an appropriate visual summary of the distribution of the lengths of the presidents’ stay in office.

dateFormat <- "%m/%d/%Y"

namesterms$Took.office <- as.Date(namesterms$Took.office, format = dateFormat)
namesterms$Left.office <- as.Date(namesterms$Left.office, format = dateFormat)

namesterms$Duration.in.office <- as.numeric(namesterms$Left.office - namesterms$Took.office)

hist(
	namesterms$Duration.in.office,
	main = "Duration in Office",
	xlab = "Days", ylab = "Frequency"
)

Question 1 Output 1 - Written Assignment 1 (STAT 254).png

(c)

Now that you have a plot, you ought to describe the distribution in the body of your assignment. As part of your report, describe four characteristics of the histogram, or the data used to create the histogram.

The data shows 2 clear peaks suggesting it's bimodal, one near 1500 which indicates the group of presidents that only served one term, it additionally shows a second peak at about double that which is likely the group of presidents that served their full 2nd term. There are more occurrences in the first peak suggesting that more presidents server just one term than two terms.

Additionally there is a single outlier on the far end of the duration which was FDR who after his term an amendment was added to the constitution to ensure nobody served over 2 terms. The data set is generally skewed to the left suggesting that there are a few presidents that served very short terms.

(d)

As another part of your assignment, you also want to compare the term lengths for Democratic and Republican presidents. Create an appropriate plot for this comparison, and summarize your plot in a couple of sentences.

common_breaks <- seq(0, max(namesterms$Duration.in.office, na.rm=TRUE) + 100, 100)

republican <- hist(
	namesterms$Duration.in.office[namesterms$Party == "Republican Party"],
	breaks = common_breaks,
	plot = FALSE
)

democrat <- hist(
	namesterms$Duration.in.office[namesterms$Party == "Democratic Party"],
	breaks = common_breaks,
	plot = FALSE
)

counts_matrix <- rbind(republican$counts, democrat$counts)

barplot(
	counts_matrix,
	col=c(rgb(1, 0, 0), rgb(0, 0, 1), rgb(0, 1, 0)),
	main="Duration in Office by Party",
	xlab="Days",
	ylab="Frequency",
	legend.text=c("Republican", "Democrat"),
	args.legend=list(x="topright"),
	names.arg=common_breaks[-length(common_breaks)]
)

Question 1 Output 2 - Written Assignment 1 (STAT 254).png

Both republicans and democrats cluster around two points (1 full term and 2 full terms), however we can see the outlier on the far right side near 3 terms was a democratic president. Democrats are more represented within the main two peaks, with republicans being more likely to be out of office between their full terms.

Question 2

Consider the random variable uniformly distributed on the interval . That is, and has probability density function:

(a)

Derive following. Use first two moments ( and ) and use them to find the variance.

(i)

Find

(ii)

Find

(iii)

Find

(b)

Derive the cumulative distribution function (CDF) of .

(c)

Now suppose we have independent and identically distributed uniform random variables on . That is .

(i)

Using part (b), and the independence of , find the CDF of the random variable .

(ii)

Find the PDF of

Question 3

A standard deck of cards has four suits (hearts, diamonds, clubs, spades), each containing 13 cards: Ace, 2, 3, ... , 10, Jack, Queen, King. Hearts and diamonds are red; spades and clubs are black.

The binomial distribution has a pmf:

and has an expected value and a variance .

(a)

Suppose we randomly choose one card from each suit (so 4 cards in total). Let be the number of queens or kings among the 4 cards.

(i)

What is the probability of success on each trial?

(ii)

Give the pmf of .

(iii)

Compute and .

(b)

Suppose you win the face value of the card when the card is 2, 3, ... , 10 (value equals the number on the card), you win nothing when the card is a face card (Jack, Queen, King), you win 20 if the card is a black Ace, and you win 30 if the card is a red Ace. Let winnings from one randomly chosen card (from a full deck).

(i)

Give the pmf of .

(iii)

Compute .

Question 4

Consider the transit schedules for buses 99 and R4. Suppose the number of the 99 bus per hour is normally distributed with mean 12 and variance 2, and that for the R4 is also normal with mean 10 and variance 4. It is also known that the two buses are independent of each other. Suppose we observe the number of buses for each of the 99 and R4 during peak morning rush of 8-9am.

For each part, define the appropriate random variable(s) and explicitly write out their distributions. Specify any assumptions used and where it is used in your work.

(a)

We observe the 1 hour block (8-9am) every weekday (M-F) of 1 week. What is the
probability that we see at most 60 of the R4 bus?

(b)

For each hour we observe, we look at the difference between the two buses. What is the probability that for each hour the difference is within 3?

(c)

What is the expected number of days in 1 week (7 days) that we observe the difference
is within 3 for a particular hour (say, 8-9am) in each day?

Question 5

A laboratory operates a machine that must be inspected every morning.

  • Each morning, if the machine experiences its first failure, it immediately stops running for the day.
  • If no failure occurs, it continues operating for the entire day.
  • The probability of failure on any given morning is 0.02, independent for each day.

Once a failure occurs, the machine is sent for repair. The repair time (in days) follows an Exponential distribution with rate parameter , independent of the failure process.
Define:

  • : The number of consecutive full days the machine operates normally before the first failure (counting the day of failure).
  • : The repair time in days
  • : The total time (in days) from today until the machine is fully operational again.

(a)

What is the distribution of ? What are the mean and variance of ?

(b)

Find the mean and variance of .

(c)

Compute , given that .