Suggestions for options database structure?

SusquehannaVanna · Jan 25, 2021

I've been scraping the cme futures options settlements for a while now and finally have the free time to compile the data with the eventual goal of building some trading models. I was wondering if anyone here has any suggestions on best practices with regards to the type and structure of an options specific database, as my knowledge of DBMSs is pretty limited.

I work mostly in python and have been happy using SQLite for years (albeit only for linear instruments) and am hesitant to use SQL approach for options when so much of my data will be repeated (underlying, expiry, fwds, etc.).

The strategies I have in mind are pretty low frequency so I'm happy with daily data for now and, while I'll probably decide to buy intraday data in the future, I'll never need to store tick data. I care more about query speed because I plan on eventually trading many products on many strikes/expirations and also don't want the database size to explode.

I've heard good things about Arctic/MongoDB and also have received the suggestion of just saving to local hdf5 files rather than use a db.

All suggestions greatly appreciated.

ensemble · Jan 25, 2021

Assuming you aren't collecting quotes in real-time, here is what I do for EOD with CBOE daily files. I have a small Java/Spring app that polls the CBOE SFTP site. When a new file arrives, it is copied to HDFS. I have a temp table (schema below) that can query the CSV file directly using Impala. The program executes an 'insert into' to copy (append) the data from the temp table into a Parquet table with the same schema for permanent storage. The Parquet table is partitioned by underlying_symbol and root. Last count, I have 867,175,033 tickers in this table.

I was briefly collecting real-time data scraped from the IB API and attempted to use Apache Kudu (on the Cloudera stack), which I would never recommend. HBase or Cassandra are still reliable choices, but there are plenty of other open-source time series databases now.
Code:
create external table trading.options_history_eod_tmp
(
    underlying_symbol             STRING,
    quote_date                    STRING,
    root                          STRING,
    expiration                    STRING,
    strike                        STRING,
    option_type                   STRING,
    open                          DOUBLE,
    high                          DOUBLE,
    low                           DOUBLE,
    close                         DOUBLE,
    trade_volume                  INT,
    bid_size_1545                 INT,
    bid_1545                      DOUBLE,
    ask_size_1545                 INT,
    ask_1545                      DOUBLE,
    underlying_bid_1545           DOUBLE,
    underlying_ask_1545           DOUBLE,
    implied_underlying_price_1545 DOUBLE,
    active_underlying_price_1545  DOUBLE,
    implied_volatility_1545       DOUBLE,
    delta_1545                    DOUBLE,
    gamma_1545                    DOUBLE,
    theta_1545                    DOUBLE,
    vega_1545                     DOUBLE,
    rho_1545                      DOUBLE,
    bid_size_eod                  INT,
    bid_eod                       DOUBLE,
    ask_size_eod                  INT,
    ask_eod                       DOUBLE,
    underlying_bid_eod            DOUBLE,
    underlying_ask_eod            DOUBLE,
    vwap                          DOUBLE,
    open_interest                 INT,
    delivery_code                 STRING
)
    row format delimited
        fields terminated by ','
        lines terminated by '\n'
    STORED AS TEXTFILE LOCATION '/user/impala/cboe/options_history_eod_tmp'
    tblproperties ("skip.header.line.count" = "1");

globalarbtrader · Jan 26, 2021

SusquehannaVanna said:
I've been scraping the cme futures options settlements for a while now and finally have the free time to compile the data with the eventual goal of building some trading models. I was wondering if anyone here has any suggestions on best practices with regards to the type and structure of an options specific database, as my knowledge of DBMSs is pretty limited.

I work mostly in python and have been happy using SQLite for years (albeit only for linear instruments) and am hesitant to use SQL approach for options when so much of my data will be repeated (underlying, expiry, fwds, etc.).

The strategies I have in mind are pretty low frequency so I'm happy with daily data for now and, while I'll probably decide to buy intraday data in the future, I'll never need to store tick data. I care more about query speed because I plan on eventually trading many products on many strikes/expirations and also don't want the database size to explode.

I've heard good things about Arctic/MongoDB and also have received the suggestion of just saving to local hdf5 files rather than use a db.

All suggestions greatly appreciated.
More...

I'm also low frequency but futures not options, so my data will have lower breadth than yours but possibly more history (I store up to 40 years of daily data and then hourly data from the last 7 years or so since I started actively trading).

If the only reason you're thinking of switching is that the data is to do with repeats, is the issue that the data is getting too big? (seems unlikely given it's daily data). Or are you concerned about speed and efficiency of loading the data? You could consider doing some performance tuning in SQLite which amounts to deciding how to partition tables and files; the sort of thing that Mongo does for you, but if you understand your data structure and how it is accessed you can probably do this competently.

I used SQLite for years with only minor issues and I now use Arctic/MongoDB. The advantage of Arctic (apart from speed) is that you can just save and retrieve pandas dataframes without worrying whats going on under the hood. With SQLite you do need to do the low level thing of thinking about how the data is actually stored and writing some SQL. I've got in the habit of using Mongo for everything; you just write methods that covert your objects to and from dicts and then save them. It's very pythonic and very neat.

The one thing I would say is that I've had a couple of bad experiences with corrupted Mongo data from which there is no easy way to recover, so I'd recommend backing up both the Mongo files but also saving them in another format (I use .csv since that can never go wrong, can be manually hacked if it does, and something can always read it!).

Alternatives are things like http://www.pytables.org/ which I used to use to store my diagnostic data, just pickling the dataframes, saving as JSON... it might be worth experimenting with a few options, since it can be quite painful to migrate data and you probably don't want to do it that frequently.

GAT

SusquehannaVanna · Jan 27, 2021

Thank you both for the thoughtful replies. I am new here and am still making my way through some of the older threads but it seems like Mongo is pretty well liked. My db knowledge is very basic so I think I'll take this opportunity to use Mongo and dive into the nosql world.

From my brief check last night, I think I'll really like it. I'm an otc derivs guy so I think in smooth curves/surfaces. One of the roadblocks I've hit in the past was the time it took to query historical time series of interpolated data (probably just due to my poor db/programming skills, but time has come to do it the right way). With Mongo, it seems I'll have no problem storing the calibrated parameters, and potentially the models themselves, along with the price data and in theory should be able to quickly get what I'm looking for.

Aquarians · Feb 1, 2021

SusquehannaVanna said:
I was wondering if anyone here has any suggestions on best practices with regards to the type and structure of an options specific database, as my knowledge of DBMSs is pretty limited.

I work mostly in python and have been happy using SQLite for years (albeit only for linear instruments) and am hesitant to use SQL approach for options when so much of my data will be repeated (underlying, expiry, fwds, etc.).

The strategies I have in mind are pretty low frequency so I'm happy with daily data for now and, while I'll probably decide to buy intraday data in the future, I'll never need to store tick data. I care more about query speed because I plan on eventually trading many products on many strikes/expirations and also don't want the database size to explode.
More...

Here's a database structure for options data which you can purchase from historicaloptiondata.com :
- https://github.com/aquarians/Public...ester/src/main/resources/scripts/database.sql

Also you can take a look at the whole open sourced project (Java based) to see some basic usage of that options data.

Discussed on another thread too:
https://www.elitetrader.com/et/threads/future-of-options.354316/page-5#post-5311584

SusquehannaVanna · Feb 2, 2021

Thanks that looks interesting, I'll have to take a closer look. I already migrated my data over to Arctic/mongodb and have been extremely happy with it so far.

For those interested, migration was pretty simple. I also found it very convenient and clean to store my pickled model objects in the same database with previously computed parameters so I don't have to calibrate each time I query. I can now quickly query such things as a historical series of interpolated 1m ATM (or any strike) prices/vols/greeks across different modeling methods, which was really my goal.

rkr · Feb 2, 2021

It's hard to do this very wrong for options EOD, just do whatever you're most familiar with. I think more importantly than the database or file system are the symbology and collection process itself.

For options on futures, two obvious aspects of the symbology are the continuous contract representation and also % move representation.

As for EOD collection, Cboe day file approach is a decent way to denoise the data for modeling especially in single name equity options, but maybe less important since there's no auction imbalance to deal with in futures. In any case I recommend that you also snap the EOD at time of PnL attribution for post-trade analysis.

igr · Mar 7, 2021

For folks collecting the data over the years, I'm curious how are you ensuring redundancy? I imagine the data is valuable and it'd suck if the hard drive went out or something like that.

I imagine you subscribe to a stream of quotes and prices and try to store them. It is therefore important to stay online during the day. I wonder if you do something interesting when rolling out the new versions too.

globalarbtrader · Mar 8, 2021

igr said:
For folks collecting the data over the years, I'm curious how are you ensuring redundancy? I imagine the data is valuable and it'd suck if the hard drive went out or something like that.
More...

Backup all data files to .csv in addition to the database. Daily backups of .csv and database files to two seperate backup drives. I keep a backup machine and rotate the live and backup machine regularly; this means there is always another copy of data on another machine that is never more than a couple of weeks old. Occasional backups to USB sticks and my laptop when I'm doing research.

GAT

eo1989 · Mar 8, 2021

ensemble said:
Assuming you aren't collecting quotes in real-time, here is what I do for EOD with CBOE daily files. I have a small Java/Spring app that polls the CBOE SFTP site. When a new file arrives, it is copied to HDFS. I have a temp table (schema below) that can query the CSV file directly using Impala. The program executes an 'insert into' to copy (append) the data from the temp table into a Parquet table with the same schema for permanent storage. The Parquet table is partitioned by underlying_symbol and root. Last count, I have 867,175,033 tickers in this table.

I was briefly collecting real-time data scraped from the IB API and attempted to use Apache Kudu (on the Cloudera stack), which I would never recommend. HBase or Cassandra are still reliable choices, but there are plenty of other open-source time series databases now.
Code:
create external table trading.options_history_eod_tmp
(
    underlying_symbol             STRING,
    quote_date                    STRING,
    root                          STRING,
    expiration                    STRING,
    strike                        STRING,
    option_type                   STRING,
    open                          DOUBLE,
    high                          DOUBLE,
    low                           DOUBLE,
    close                         DOUBLE,
    trade_volume                  INT,
    bid_size_1545                 INT,
    bid_1545                      DOUBLE,
    ask_size_1545                 INT,
    ask_1545                      DOUBLE,
    underlying_bid_1545           DOUBLE,
    underlying_ask_1545           DOUBLE,
    implied_underlying_price_1545 DOUBLE,
    active_underlying_price_1545  DOUBLE,
    implied_volatility_1545       DOUBLE,
    delta_1545                    DOUBLE,
    gamma_1545                    DOUBLE,
    theta_1545                    DOUBLE,
    vega_1545                     DOUBLE,
    rho_1545                      DOUBLE,
    bid_size_eod                  INT,
    bid_eod                       DOUBLE,
    ask_size_eod                  INT,
    ask_eod                       DOUBLE,
    underlying_bid_eod            DOUBLE,
    underlying_ask_eod            DOUBLE,
    vwap                          DOUBLE,
    open_interest                 INT,
    delivery_code                 STRING
)
    row format delimited
        fields terminated by ','
        lines terminated by '\n'
    STORED AS TEXTFILE LOCATION '/user/impala/cboe/options_history_eod_tmp'
    tblproperties ("skip.header.line.count" = "1");
More...
Is there any reason why you choose to store only the data point at time 1545 and not at 1559 or 1600? Or is it something that CBOE does on their end? (Now that I mention it I do recall their site stating something along these lines...)