Correlations math

Discussion in 'Strategy Building' started by thecoder, Dec 2, 2021.

  1. thecoder

    thecoder

    I plan to calculate the correlation number for each of the S&P500 items using EOD data.
    It seems this is the formula for the number of comparisons needed of the timeseries of each:
    500 * 499 / 2 = 124750
    This is IMO much work for the computer.
    I just wonder whether there is a faster algorithm/method with less such comparisons? Anybody know?

    What about this method: compare each to the same fixed one. Wouldn't that give the same rank information? And rank information is totally sufficient, IMO. This method would require only 499 comparisons (if the fixed one is among the 500).
     
    Last edited: Dec 2, 2021
  2. I'm not sure what you mean by 'rank information' so the value of my answer depends on what you mean there, but you will, of course, lose information by only running the correlations against one fixed stock instead of all pairs. It depends on what type of information is important to you.

    On my computer, I estimate an upper limit of ~18 sec run time using R for 20 years of EOD data for each pair. (Obviously some pairs would have more data than others.)

    x = rnorm(20*252)
    y = rnorm(20*252)

    microbenchmark::microbenchmark(
    cor(x, y)
    )
    Unit: microseconds
    expr min lq mean median uq max neval
    cor(x, y) 61.6 61.9 143.092 62.2 62.65 8072.8 100

    If your computer is A LOT slower than mine, I would suggest just running it overnight and saving it for future reference.
     
    Last edited: Dec 2, 2021
    jys78 likes this.
  3. rb7

    rb7

    IMO, your math is wrong.
    It would be more something like ( 500 * 500) - 500
    Even if this comes out to the double of what you have calculated, it's a piece of cake in terms of computer work. It depends also how far back you want to calculate the correlation. But a couple of years back should take very long, computer calculation wise.

    We're talking about EOD date here, not tick data.
     
    d08 and Statistical Trader like this.
  4. Edited my answer above. Thanks to rb7. Accidentally mistook microseconds for milliseconds and corrected my reply.

    As for the number of comparisons, it's irrelevant in terms of computation time.

    That said, for interest sake, it's not (500 * 500) - 500 because half the correlations are identical; cor(x,y) = cor(y,x). So it's (500^2 - 500 ) / 2, which is equal to 500*499/2, given by the OP.
     
    thecoder and rb7 like this.
  5. thecoder

    thecoder

    Hmm. why that formula? And why not the one I posted as it looks so obvious for me?

    You both could be right with the runtime using EOD data.
    I have fast multicore machines here, that shall not be the problem. I just was looking for an optimized algorithm if such one exists.
     
  6. rb7

    rb7

    StatTrader corrected my calculation error in previous post.

    But like I wrote earlier, it doesn't come out to a very large number of calculation (for a computer!). An optimized algo will save you some time machine, but by the time you find and implement it, it's not worth it. Performance enhancement is, IMO, the last thing you should worry about.
     
    thecoder and Statistical Trader like this.
  7. thecoder

    thecoder

    I mean when the computed correlation numbers with the associated ticker get sorted descendingly by this computed number column... and that then one uses the rank number 1...500 .
     

  8. I think rb7 took the entire correlation matrix (500*500) and subtracted the diagonal of correlations equal to one (500). He/She just forgot to divide by 2 because cor(x,y) = cor(y,x)
     
    thecoder likes this.
  9. I agree with rb7 that performance enhancement isn't important here. The only exception might be with high frequency data, especially if you are high frequency trading based on real time correlations, but I don't think this is what you're doing.
     
  10. Okay, so in that case, no, you definitely lose information by only calculating the correlation against a single stock. IMO you'll have to think more about your definition because the order will change depending on which stock you hold constant.
     
    #10     Dec 2, 2021
    thecoder likes this.