Why use a database?

Discussion in 'Data Sets and Feeds' started by onelot, Oct 9, 2004.

  1. onelot

    onelot

    There's been some discussion on databases lately and I'd like to pose the question of why even use a database? From the research I've done it looks like a lot of traders seem to prefer binary data files for storing price/quote information. Apparently binary data is much faster to read and write to.

    For instance, if my purpose was to have an efficient means of storing gigs of tickdata and this data was the backbone to my bactesting engine, whereby strategies were run off of different timeperiods etc., why would I need a database? Why not call and hold the pertinent data in memory? From my research, dbs excel at query based searches... unless your backtesting analysis needs this type of functionality, again why would use a database?

    Background: I'm currently in development of a backtesting/trading engine and have been looking at the different ways large amounts of data can be stored, I have no experience with either method of storage. Here's a great discussion by experienced mechanical traders on the merits of binary data and the pitfalls of dbs. It's where I based most of my questions from and it's definitely worth a read:

    http://www.turtletradingsoftware.com/forum/viewtopic.php?t=980

    http://www.turtletradingsoftware.com/forum/viewtopic.php?t=791

    Would love to hear your guys' thoughts on this. It would be great to hear examples... I've been reading a lot of the theory and it's a bit difficult to base decisions upon. Concrete examples of how one has used different methods and their comparisions, always seems to triumph over theory.

    Good trading.
    onelot
     
  2. Depends somewhat on your specific application and needs, but in general I'd have to say that using a DB to store price data is not inherently performance detrimental.

    One instance I can think of off the top of my head where stream binary files could be better is if your implementation requires routinely having to reload large amounts of data (i.e., tens of thousands of records) for a specific security at a time for some kind of processing, then move on and do the same thing for the next security, etc. - but there are so many analytic applications that don't involve that, that not using a DB is probably more the exception than the rule.

    In my experience, the typical reason for performance problems when using a DB (and this extends to most applications, not just trading/analysis) is poor database design (in some cases the logical design sucks, in other times the physical design is poor) and/or poor design of the access/processing.

    Considering that most datafeeds store their large datasets in some form of DB and that the realtime systems at the major exchanges all use DBs (I'm specifically familiar with NYSE, NYMEX, and CME - but would guess the other exchanges do similarly) and I'm also very familiar with numerous other realtime/high volume systems that all use integral DBs - it can be generalized that well designed and implemented systems can use DBs without performance impact.
     
  3. linuxtrader

    linuxtrader Guest

    Database systems are optimized for set based operations: that is you need to get a set of data according to some criteria and then re-arrange it, or perform some simple calculations on it. They can handle numerous fairly complicated operations on large sets of data with very good performance - provided that the database and queries are properly designed.

    Also, modern database systems are desiged to allow for ensuring data integrity: that is they allow the application of sets of rules to help ensure that the data is not modified in any undesirable way.

    Also, moden database systems have mechanisms to allow for backup and most importantly recovery to a point in time. Thatis you can roll back changes in a database to a particular point in time.

    So, if you need many people (or even just a few people ) asking set based questions on larger amounts of data, and the integrity of the data is most important, then something like mySQL , SQl server, Oracle, Sybase, or DB2 would be desirable.

    You really just need to compute the costs to decide if a flat file or a relational database system is of value to your project: my guess is that it is.
     
  4. kc11415

    kc11415

    I've run many gigs of ticker data within big multi-cpu Unix servers with many gigs of RAM and hundreds of gigs of fiber-channel disks organized as RAID arrays. I have also used awk & Perl on a laptop against gigs of tick data stored in simple text files.

    Depending upon the question to be answered, either one could deliver a result faster. Faster sometimes means run-time performance of the h/w & s/w, and sometimes it means the amount of time it takes the user to code up a query.

    When most people say "database s/w" they usually mean "relational database." Relational databases are optimized for OLTP (OnLine Transaction Processing), with a lot of small read-write activity. Only in more recent years has read-mostly data-mining, data-warehousing & OLAP functions been grafted onto a foundation which is still inherently biased towards OLTP. And, these issues are separate from which platform is easiest for expressing the question you want answered.

    For working on individual series of tick data, plain files read by a script or custom program are probably fastest. For studying relations between a small number data series, something like SAS might be useful. For cross-tabulations between dozens or hundreds of data series, a relational database may be easier to write up a query.

    And these generalizations may not apply to your specific circumstances.
     
  5. linuxtrader

    linuxtrader Guest

    Exactly. If you need to ask then you need to hire someone to help you ...... unless you have no money to pay for help in which case you have a lot of work ahead of you .....
     
  6. I've got an eclectic background. Started programming in high school over 20 years ago writing futures trading systems. Had a bit of fame in my early twenties as a trader and then left for 15 years to start a few software companies.

    One of them sold an embedded database which was the number one product on the Macintosh. I worked on the internals, disk access, etc. as well as the query optimization.

    There is a huge difference in read time between a database and a binary file unless the database has been specifically optimized for large binary data storage (known as BLOBs in the business).

    The reason is simple, even in a database with an efficient caching mechanism large data sets generally involve multiple reads from the disk because the data is split up into chuncks. Every separate read will take a while because on average it will require 1/2 of a rotation of the disk before the data comes under the read heads so the read can start.

    Unlike almost every other aspect of computing, disk speeds have not followed Moore's Law. Disks are maybe 30 to 100 times faster than they were 20 years ago while computers are 10,000 times faster.

    Even a 10,000 RPM disk takes 6 milliseconds to rotate. So you only get 167 rotations per second. That's a lot of time when computers are doing billions of instructions per second.

    For tick data analysis the speed of reading the data is the determining factor for the speed of testing unless you have very inefficient code or are doing esoteric analysis.

    So I suggest storing information about your data in a database but storing the physical data on the disk in raw binary files.

    You can get acceptable performance from a database if you know what you are doing, however, you will always pay a performance penalty.

    - Curtis
     
  7. linuxtrader

    linuxtrader Guest

    Like I said, if you need to ask then you need help.

    The question here is one of proper application design. These little snippets of posting advice wont help .... The toughest part of any application is figuring out the requirements and the most cost effective solution. Millisecond response time may be of no consequence to this application - or it could be everything.... and the flow of data is unknown. No way to know here and nobody is going to devote too much time here to build an application design for free: my advice is to hire someone to help if this is a real project in need of solution.
     
  8. kewl thread...

    keep in mind google doesn't use disks for web queries...just banks of RAM...

    and they run python...not even C or assembly...

    kinda puts things in perspective...
     
  9. linuxtrader

    linuxtrader Guest

    For some applications I hold everything in an in-memory structure. For others I am happy holding most everything in a DB with some work offloaded and some data held in memory - but not all.

    For other applications it is best to keep everything in a relational database. It just depends upon what the usage priorities are and what your cost target is ......

    ..and we use python too for soome applications with a lot of data held in memory.

    Its up to you - or a qialified professional to make these engineering decisions (both technical and financial)
     
  10. kc11415

    kc11415

    inflector > Even a 10,000 RPM disk takes 6 milliseconds to rotate. So you only get 167 rotations per second. That's a lot of time when computers are doing billions of instructions per second.

    This is very big factor in random access to a block here and a block there, such as in OLTP. While never irrelevant, it is less of a factor for long sequential reads, particularly when using a RAID stripe + mirror set in which the read could be satisfied from any of a number of mirrors. In such cases, the bottleneck becomes the PCI bus, which operates substantially slower than the memory bus. Then, a system with multiple PCI busses and disk controllers spread across them can get you running a bit faster.

    If this sounds expensive and also sounds like a lot of work, it is. With memory as cheap as it is these days, use a s/w driver to emulate a filesystem in memory. Then you only have to worry about persistence of any data changes.

    But, are you always going to keep your files in RAM? What about daily updates and data cleansing? Are you going to copy those updates elsewhere also? What happens when you get a reboot? How long will it take to reestablish all your data in memory from your persistent copies on disk? Will this time delay be relevant in the middle of a trading day?
     
    #10     Oct 13, 2004