Thanks guys for all the reponses, great comments. To give a little more background. My testing involves running a stat analysis program against single/multiple products 100's of thousands of times. The closest thing I could compare it to, would be a brute force optimization engine that was looking at up to 10+ variables in a run on a LOT of data. It takes days sometimes. It's definitely not realtime. The results are just summaries and stats, small files. The main bottleneck is h/w, my code, and file I/O. H/w and code I can optimize, but I currently have all data in ascii's and just trying to figure out the best route to go. I'm not looking for advice on how to setup a database, but rather comparing their efficiency to binary files in different applications. Please, don't think I'm looking for application design advice for free (geez). If after concluding that dbs are the way to go I'll pay someone or do it myself... I promise I won't ask you for help, linuxtrader . Anyway, thanks again for the input... it has indeed helped a great deal. oenlot
I'm currently coding my own backtest engine and, for the time being, am storing all my data in text files. However, I do anticipate problems when I start to work with intraday data and therefore have recently spent some time investigating the storage issues. I came across this page: http://www.cs.nyu.edu/shasha/papers/jagtalk.html which recommends an array-based storage approach. It seems the ideal database would be something like kdb (or FAME) but they're probably out of my price range. I've done some further digging and came across two free databases which may be suitable for storing price data: Metakit: http://www.equi4.com/mkviews.html - a simple (perhaps too simple?) embedded db MonetDB: http://monetdb.cwi.nl/ - seems like a pretty feature rich OLAP engine However, neither of them seem to have a mass following and that makes me a bit too uncomfortable to commit. I wonder if the programming gurus here have any experience with them and can comment on their suitability as a DB backend for a backtest engine? Curtis, I understand that you're using an embedded db engine for your Veritrader. Was that DB developped in-house or is it a 3rd party app? If it's a 3rd party app, would you mind telling us the name of the vendor? Also, if anyone is using a fast, simple embedded db (preferrably free and not Windows-only), I would like to know what your're using. Thanks!
Choppystride, wow, thanks for the great link... that's pretty much exactly what I was looking for. I'm going to be reading it in more depth at the close. You mentioned being interested in kdb... someone mentioned here that kdb had a free version for non corporate users. When I searched their site I found some references to it in their user forums... apparently, it's a no limit time trial on their older db product. Might be worth a look. Thanks again for the link. onelot
...Loading everything into memory can work - or not. More often than not the code and the design is the problem. That is somewhere you have a limiting algorithm. Also, when you design your applications you need to put instrumentation code into them that allows for performance monitoring beyond what the OS provides ... few people think about this issue. It should be a simple thing to insert some monitoring code and run a few tests to see the easiest way to go with your application. Later you can add more comprehensive monitoring code if this app has any long term value. Good luck .. and by the way I am not looking for work but there are plenty of people on this board that are .....
...but kx also states: Evaluation software is for non-commercial use only and provides limited workspace and messaging and that kinda scared me off.
Sure, every application is different, etc. But I think there is a right way to handle large amounts of time series data, and that is to keep it in a true transactional database and then cache as necessary for performance. Using a database has all sorts of advantages (transactions, ease of backup and restore, strong typing, powerful queries, flexibility). And a good database engine will perform well. If you end up having performance problems with your database when analyzing time series, then you should cache the data within your analysis program, either on disk or in memory depending on your requirements. My data capture, backtesting and automated trading use the same PostgreSQL database - often at the same time. I load data from the database into arrays through a disk cache and a memory cache. All written in Python. Martin
Sparohok>Sure, every application is different, etc. But I think there is a right way to handle large amounts of time series data, and that is to keep it in a true transactional database and then cache as necessary for performance. I think this depends upon your definition of "time series data." If you are referring to one OHLC bar per day per instrument, then I could see the merits of your argument. If you mean to imply Intraday time-series data, perhaps down to the level of resolution of each tick in the Time And Sales data, then I'd disagree in most cases. ------------------------------------------ onelot>My testing involves running a stat analysis program against single/multiple products 100's of thousands of times. The closest thing I could compare it to, would be a brute force optimization engine that was looking at up to 10+ variables in a run on a LOT of data. It takes days sometimes. It's definitely not realtime. The results are just summaries and stats, small files. The main bottleneck is h/w, my code, and file I/O. H/w and code I can optimize, but I currently have all data in ascii's and just trying to figure out the best route to go. This sounds like the kind of workload upon which institutionals can spend considerable resources. Assuming you're trying to do this on a private trader's budget, you probably won't get it as fast as them, but you may get close enough to tolerate. For some context, consider that strict interpretation of modern portfolio theory would require that beta for each stock in the S&P500 be calculated by correlating against all 499 of the other stocks. This would require calculating 250,000 time-series correlations. The common short-cut is to calculate beta as the correlation between each stock's time-series and just the index itself. Very few institutionals will invest what is required for a strict calculation of beta. If the institutionals take such short-cuts, then might you adapt this notion to your own needs? __________________________________________________________ linuxtrader>The question here is one of proper application design. .... The toughest part of any application is figuring out the requirements and the most cost effective solution. With what little details are available, it sounds like each input parameter is being is being varied across a wide range in order to find the values for each parameter which achieve the optimal end result for the whole system. If so, then perhaps this problem might be a candidate for Linear Programming methods, as an alternative to brute force numerical methods. On a related tact, if a way can be found to express the problem algebraically, then perhaps it might be possible to calculate the PDE's (Partial Differential Equations) with respect to each input parameter to calculate the rate of change for the net result depending upon changes in each parameter. Knowing this might allow a binary search within the range of values for each parameter so as to navigate the solution space in a more precise and direct manner. __________________________________________________________ Otherwise, for optimization on a budget perhaps consider some of the following: 1) Hawk whatever you must to get money for a motherboard allowing more RAM, and stuff it with several GB's. 1a) The most economical way to do this is probably with AMD Opteron or FX-51/53 1b) More memory is more important than a fast CPU or fast disks. 2) If you can't afford fancy disk controllers, then at least buy multiple cheap disks and cheap controllers, with input files on different disks than your output files. 3) If you can't spread files out on a one per disk basis, then make partitions as small as possible while still being able to keep all files within one partition on any given disk in order to minimize seeks across the whole disk. 4) ASCII text encoded numerical data is not only bulkier on disk and in memory, it also takes time to convert to numerical equivalents. Therefore, have a preprocessor which converts your persistent store ASCII versions into binary for use at run-time. 5) If you will not be doing purely sequential access through each file, but rather will be seeking to random positions within the file then you may need an indexing scheme, whether you write it or you buy it. ( If someone searches sequentially to find a row matching desired contents, then that's CS blasphemy, aka PEBCAK ) 6) consider scrounging up multiple old retired computers with at least P4 1GHz cpu's such that each subset of the total job can run on a dedicated machine and not have to compete for disk access. 7) Use an O/S with less overhead burden upon the h/w i.e. Unix or Linux 8) If you must use Windoze, then turn off many of those superfluous services you don't need. 9) if in a prop trading shop, consider negotiating to be allowed to run parallel distributed jobs across all workstations in the office, kind of like the SETI project, but only out of hours.
so, what is the best way to store, retrieve high frequency minute or tick data for backtesting and doing analytics on it? through a flat file with c++ or through a db? which way is faster?
kdb's evaluation version has a 100MB storage limit and 1000 messages limit. If you want to buy it prepare to spend about 100k for a base install. They do however, make versions available to universities. If you want to go cheaper, try ORACLE and it's Time Series analysis package which shows about the same in performance.