Garbage data at Polygon.io?

Discussion in 'Data Sets and Feeds' started by BlackPhoenix, Feb 18, 2024.

  1. I started to use Polygon.io for my historical data and using their 5sec aggregates. I haven't paid much attention to the data quality until now and noticed these crazy high/low spikes in the data that appear all over the place, which is obviously a problem for backtesting.

    Below is an example of TSLA (Feb 5th, 2024) for Polygon:
    upload_2024-2-18_22-57-22.png

    The left-most spike is at 11:44:55. When comparing to IBKR 5sec bars in TWS I don't see such spikes:
    upload_2024-2-18_23-4-57.png

    Besides the spikes the price data matches pretty close between the two. I first though it might have been an issue in my code that parses the Polygon data, etc. but when I examined the raw data from Polygon I saw this (for the left-most spike):
    Code:
    {"v":32874,"vw":177.9352,"o":177.8991,"c":177.925,"h":177.97,"l":177.8801,"t":1707151485000,"n":308},
    {"v":23869,"vw":177.9224,"o":177.91,"c":177.94,"h":177.94,"l":177.9,"t":1707151490000,"n":155},
    
    {"v":17460,"vw":178.0978,"o":177.9301,"c":177.9315,"h":184.14,"l":177.9194,"t":1707151495000,"n":174},
    
    {"v":16172,"vw":177.9604,"o":177.934,"c":177.9564,"h":177.98,"l":177.934,"t":1707151500000,"n":225},
    {"v":19644,"vw":177.9671,"o":177.9575,"c":177.96,"h":177.99,"l":177.95,"t":1707151505000,"n":222},
    
    As you can see the high-value for timestamp "1707151495000" is over 6 points higher (184.14) than the surrounding high-values, so it's definitely an issue in the Polygon data stream.

    This is not an isolated incident but I see these same spikes in pretty much every stock for various different days I have checked. They are all over the place.

    So, is IBKR filtering the spikes somehow and these are real spikes in the price data, or is Polygon.io data corrupted?
     
    EdgeHunter, rkr and Quanto like this.
  2. Craig66

    Craig66

    I can't speak for Polygon in particular, but I know these types of spikes are also in IQFeed trade data and in that case they are a certain type of trade (I can't remember the specific type). Polygon is perhaps rolling these trades into their aggregate data. Polygon do have a slack channel on which you can ask questions.
     
    EdgeHunter and BlackPhoenix like this.
  3. traider

    traider

    How much do you pay for the data?
     
  4. maxinger

    maxinger

    I receive garbage data every now and then even though I pay the data fees.
     
    EdgeHunter likes this.
  5. Did you try Databento?
     
    rkr likes this.
  6. AGREE ! -- I confronted IQFeed Top Data Guy years ago about these BAD data spikes... His comment... ""Was Gee, I didn't know that... Really..."" I think his comment was total BS.

    There is SO many things about trading that make it SO difficult. But to have to PAY for something that is not of high quality that affects your profession is exhaustingly disgusting.
     
    Quanto likes this.
  7. Nope, could be good for daily incremental updates, but seems to be pretty expensive for initial download
     
    DarkTemplar likes this.
  8. The mystery solved. Looks like these spikes are dark pool trades. It's insane that Polygon includes dark pool trades in their aggregate data and there's no way to exclude them (without downloading tick data). This is not representative what price data you get live and completely messes up backtesting.
     
  9. Polygon.io

    Polygon.io Sponsor


    Hey, Jack from Polygon here. I want to clarify that we do not incorporate ineligible dark pool trades into our aggregates. Upon investigating the issue, it appears there might be a bug related to handling specific exchange messages.

    In this particular case, we received two trade messages: one with a Prior Reference Price (PRP) and Trade Thru condition, and another with no conditions attached. It's important to note that we exclude PRP or Trade Thru trades from our aggregates. Therefore, it's the second message that has affected the candlestick data.

    I'm currently investigating why the second message was reported with no conditions. Apologies for the confusion.
     
    BlackPhoenix likes this.
  10. Thanks for having a look at it. Once you find the bug, I would like to know does this influence your entire database and thus requires me re-downloading all the data again. Luckily my ISP doesn't have a data cap. Meanwhile I'll probably write some kind of data integrity check to estimate to what extent my downloaded ~200GB JSON database is corrupted.
     
    #10     Feb 20, 2024