**If you find any of this useful, please consider donating via PayPal to help keep this site going.**

**Email news@statisticool.com to sign up to receive news and updates**

# More on Correlation

**12/16/11**

A year ago I wrote an article discussing the Pearson correlation and some issues to keep in mind when using it. I thought I'd revisit the topic of correlation, with a twist.

Recall that the sample correlation between two variables X and Y is calculated as

where and .

With rank defined as , replace x_{i} by rank(x_{i}) and y_{i} by rank(y_{i}) in the formulas above,
where I(.) is in indicator function that gives 1 if x_{j} <= x_{i} and 0 otherwise. What would you get? You'd actually get a
very useful correlation coefficient called the Spearman correlation coefficient, which I will denote by r*.

The Spearman correlation coefficient is analogous to r and has the advantage of not being as sensitive to outlying data. It also has some disadvantages as you need to think about how to deal with ties in the ranks (and there are many ways, the midrank method being the most popular), and it is somewhat difficult to interpret in a population parameter sense. That is, r estimates the true population correlation p, but r* does not, nor does r* estimate some true population rank correlation p*, according to Gibbons and Chakraborti (2011, p. 416). Note that the SPEARMAN option in PROC CORR in SAS is not valid with the WEIGHT statement. There are many other pros and cons, but this should give a flavor of them.

In SAS the Spearman correlation coefficient can be computed by doing a PROC CORR with the SPEARMAN option like so

proc corr data=mydataset spearman;

var x y;

run;

Here is an example of a real use of it. Staff had asked me what correlation to report, r based on x and y, or r after they had
transformed the data by taking the logs of x and y, or report both r's. In *this case*, since the log function is monotonic, the
rankings would be preserved, so I mentioned that r* based on x and y or r* based on log x and log y would be identical, whereas
r based on x and y and r based on log x and log y could be (and in this case was) very different. In addition, their data had a
few outliers in it, so r* was a decent choice for them to consider.

The moral of the story is to really think of which type of correlation you want to use and the assumptions that are involved.

### Please anonymously VOTE on the content you have just read:

Like:Dislike:

If you enjoyed *any* of my content, please consider supporting it in a variety of ways:

**PLEASE**take a moment to check out two GoFundMe fundraisers I set up. The idea is to make it possible for me to pursue my passions. My goal is to be able to create free randomized educational worksheets and create poetry on a__full-time basis__.**THANK YOU**for your support!- Email news@statisticool.com to sign up to receive news and updates
- Donate any amount via PayPal
- Take my Five Poem Challenge
- Subscribe to my YouTube channel
- Visit my Amazon author page
- Buy what you need on Amazon using my affiliate link
- Follow me on Twitter here
- Buy ad space on Statisticool.com

AFFILIATE LINK DISCLOSURE: Some links included on this page may be affiliate links. If you purchase a product or service with the affiliate link provided I may receive a small commission (at no additional charge to you). Thank you for the support!