Why do scatter plots of summary statistics




















For a linear relationship there is an exception. When you look at a scatterplot, you want to notice the overall pattern and any deviations from the pattern. The following scatterplot examples illustrate these concepts. In this chapter, we are interested in scatter plots that show a linear pattern.

Linear patterns are quite common. The linear relationship is strong if the points are close to a straight line, except in the case of a horizontal line where there is no relationship. If we think that the points show a linear relationship, we would like to draw a line on the scatter plot. This line can be calculated through a process called linear regression. However, we only calculate a regression line if one of the variables helps to explain or predict the other variable. If x is the independent variable and y the dependent variable, then we can use a regression line to predict y for a given value of x.

Scatter plots are particularly helpful graphs when we want to see if there is a linear relationship among data points. They indicate both the direction of the relationship between the x variables and the y variables, and the strength of the relationship. We calculate the strength of the relationship between an independent variable and a dependent variable using linear regression.

Construct a scatter plot of the data. The following table shows the poverty rates and cell phone usage in the United States. Does the higher cost of tuition translate into higher-paying jobs? The table lists the top ten colleges based on mid-career salary and the associated yearly tuition costs. In both examples, a boxplot of the variables show an outlier. First plot: The x-axis variables is in fact a constant, i.

Find out why the x variable is a constant. Second plot: obviously we missed that both variables are in fact categorical and the scatterplot is not the appropriate tool to study the relationship nor regression.

First plot: There are very clearly to different groups with an obvious linear relationship in each. You will need to perform a regression for each group.

When you move from lower to higher values in X, there is systematically more and more variation. Although there is a linear trend in the first plot, a regression will run into trouble.

There are a few outliers; the main cloud is so small that you cannot see what the relationship might be. Do not be misled, the few observations that seem to indicate a positive linear trend do not necessarily reflect the general trend in the cloud of points, based on all observations.

It is also possible to obtain other quantiles; this is done by adding an argument containing the desired percentage cut points. To get the deciles , use the sequence function:.

How would you use this method to get quintiles? We can also get summary statistics for multiple columns at once, using the apply command. Error in FUN newX[, i], We get an error because the data contains missing observations!

R will not skip missing values unless explicitly requested to do so. You can give the na. There is also a summary function that gives a number of summaries on a numeric variable or even the whole data frame! Median Mean 3rd Qu.

Median : Mean : NA's : Notice that "Month" and "Day" are coded as numeric variables even though they are clearly categorical. This can be mended as follows, e. Find the standard deviations SDs of all the numeric variables in the air quality data set, using the apply function.

The simplest display for the shape of a distribution of data can be done using a histogram- a count of how many observations fall within specified divisions "bins" of the x-axis. A sensible number of classes bins is usually chosen by R, but a recommendation can be given with the nclass number of classes or breaks argument. By choosing breaks as a vector rather than a number, you can have full control over the interval divisions. There are a LOT of options to spruce this up. Here is code for a much nicer histogram.

If we want to fit a normal curve over the data, instead of the command density we can use dnorm and curve like so:. If you type help hist into the command line, it shows all the possible parameters you can add to a standard histogram.

There are a lot of options. To see whether data can be assumed normally distributed, it is often useful to create a qq-plot. In a qq-plot, we plot the k th smallest observation against the expected value of the k th smallest observation out of n in a standard normal distribution. We expect to obtain a straight line if data come from a normal distribution with any mean and standard deviation. The observed empirical quantiles are drawn along the vertical axis, while the theoretical quantiles are along the horizontal axis.

With this convention the distribution is normal if the slope follows a diagonal line, curves towards the end indicate a heavy tail. This will come in handy when we move on to linear regression. After the plot has been generated, use the function qqline to fit a line going through the first and third quartile. This can be used to judge the goodness-of-fit of the QQ-plot to a straight line. Use a histogram and qq-plot to determine whether the Ozone measurements in the air quality data can be considered normally distributed.

A "boxplot", or "box-and-whiskers plot" is a graphical summary of a distribution; the box in the middle indicates "hinges" close to the first and third quartiles and median. The lines "whiskers" show the largest or smallest observation that falls within a distance of 1. If any observations fall farther away, the additional points are considered "extreme" values and are shown separately.

A boxplot can often give a good idea of the data distribution, and is often more useful to compare distributions side-by-side, as it is more compact than a histogram. We will see an example soon. We can use the boxplot function to calculate quick summaries for all the variables in our data set—by default, R computes boxplots column by column. Notice that missing data causes no problems to the boxplot function similar to summary.

Figure 2. Figure b is not really meaningful as the variables may not be on comparable scales.



0コメント

  • 1000 / 1000