Appropriate design pattern for recording instant tick data in memory

rohan2008 · Dec 11, 2014

Hi everyone,

I am trying to design a component that inputs futures tick data from rithmic feed/RApi [C++11, Linux]. I want to capture the tick data for approximately 40 contacts live (off course, half of them are low liquidity forward contacts anyway). I have this rithmic call back routine that returns me tick data and now, I am trying to figure out whats the best data structure that that I can use to quickly retrieve and fill in the appropriate contract object. I would like this process to be as optimized as possible.

One simple way is to use std::map<contract name, object>/std::unordered_map... but I always tend to be a little suspicious about using STL libraries in critical paths. The other idea is to develop a balanced binary/RB tree by hand... which can be done. Before I pick and choose a path, I was wondering how any of you have addressed this issue and what your experiences/insights are.

Although my strategies don't need low latency... its good to design so that I don't have to bother revisiting this area in future in case I develop low latency (Absolute NO NO to HTF strategies!) strategies.... and I might even co-locate this system. Any advice is highly appreciated.

thanks

vicirek · Dec 12, 2014

rohan2008 said:
Hi everyone,

I am trying to design a component that inputs futures tick data from rithmic feed/RApi [C++11, Linux]. I want to capture the tick data for approximately 40 contacts live (off course, half of them are low liquidity forward contacts anyway). I have this rithmic call back routine that returns me tick data and now, I am trying to figure out whats the best data structure that that I can use to quickly retrieve and fill in the appropriate contract object. I would like this process to be as optimized as possible.

One simple way is to use std::map<contract name, object>/std::unordered_map... but I always tend to be a little suspicious about using STL libraries in critical paths. The other idea is to develop a balanced binary/RB tree by hand... which can be done. Before I pick and choose a path, I was wondering how any of you have addressed this issue and what your experiences/insights are.

Although my strategies don't need low latency... its good to design so that I don't have to bother revisiting this area in future in case I develop low latency (Absolute NO NO to HTF strategies!) strategies.... and I might even co-locate this system. Any advice is highly appreciated.

thanks
More...

STL is created with performance guarantees and carefully crafted for industrial strength applications. It is difficult to beat STL implementations. I do not know where from your suspicion comes in regards to STL. I do not know of any other library in any language that would be better than STL. Please note that implementations may vary between vendors but they have to meet minimums set in the standard.

Map is usually implemented as binary tree but it is worth checking STL implementation documentation and it may provide more information since some folks do it diffrently.

There are some situations that doing something by hand instead of STL is beneficial but in this case it is waste of time.

It is difficult to advice on choice of container without knowing how data will be processed and searched. Tick data is best stored in dynamic container like vector because ticks are usually processed sequentially and you can have map of vectors of tick objects mapped by contract name.

I am not sure if in this case you really need map. You know your contracts and you can directly index into vector containing another vector with ticks and process them sequentially.

Problems with STL performance usually comes with bad choice of STL container for the task where searching and deletion/insertion operations are required.

hft_boy · Dec 12, 2014

rohan2008 said:
Hi everyone,

I am trying to design a component that inputs futures tick data from rithmic feed/RApi [C++11, Linux]. I want to capture the tick data for approximately 40 contacts live (off course, half of them are low liquidity forward contacts anyway). I have this rithmic call back routine that returns me tick data and now, I am trying to figure out whats the best data structure that that I can use to quickly retrieve and fill in the appropriate contract object. I would like this process to be as optimized as possible.

One simple way is to use std::map<contract name, object>/std::unordered_map... but I always tend to be a little suspicious about using STL libraries in critical paths. The other idea is to develop a balanced binary/RB tree by hand... which can be done. Before I pick and choose a path, I was wondering how any of you have addressed this issue and what your experiences/insights are.

Although my strategies don't need low latency... its good to design so that I don't have to bother revisiting this area in future in case I develop low latency (Absolute NO NO to HTF strategies!) strategies.... and I might even co-locate this system. Any advice is highly appreciated.

thanks
More...

Main thing is to keep the keys small and easy to compare. If you can fit them in 8 bytes each, you can store an array of key,pointer_to_value pairs at 16 bytes each, total <1KB. If it's hot then it will usually be in the cache and you can search it performing at least 1 compare per cycle, so you'll worst case find the key in like 40 compares worst case, or <10ns on a typical Xeon core (actually because of pipelining and branch prediction I suspect it will be much smaller but I am being conservative). Then you have to dereference the pointer costing another hit to main memory (<100ns usually), or L2 cache if you keep your data structure small. That fast enough for you?

The main bottleneck here is the pointer dereferencing, not the key searching, assuming you keep your keys small and easy to compare (number of instructions matters!). If you use a hash table and there are any collisions that may cost you. I am guessing though that it will be roughly the same speed.

rohan2008 · Dec 13, 2014

vicirek said:
STL is created with performance guarantees and carefully crafted for industrial strength applications. It is difficult to beat STL implementations. I do not know where from your suspicion comes in regards to STL.
More...

No, don't get me wrong; STLs are great libraries, no question about that. I personally use STL libraries (Poco actually) and other C++11 features pretty extensively in the trading system that I have developed. Having said that, I have personally observed this the years:

The architecture of any high performance system can be divided into critical paths (5% of the code that executes 95% of the time) and non-critical paths (rest of the 95 code where execution happens 5% of the time). As I see STLs are awesome for the non-critical paths; I use them extensively. However, I have seen direct performance degradation when STLs are used in the critical code path. I can point you to real life examples where this happened here in the silicon valley. Most often companies developing user space device drivers in C++ hire Linux kernel device drivers, ask them to learn C++ and develop code in the critical code paths. Every person who learns C++ gets fascinated with containers [including myself ] initially, use containers in the critical code, find out that they need to make the code run faster… experiment with replacing containers and then find out that traditional C data structures dramatically improve performance.

To give you some numbers, I have used G++ 4.7 with standard clib on linux 3.11 (I guess) and I found that new() takes 3 times more clock cycles than malloc(); we have confirmed this by running perf tools over tcmalloc. This has direct impact on the way we structure out code in the critical path. Too many object allocations bring down the performance. Another example I can tell you is std::stringstream (stay away from it in critical paths). Vectors are slow compared to linked lists; I have had first hand experience with this. Variadiac templates as function arguments are fast, but when used in class constructions literally drag down the performance. Class instantiations take time as well. One example I can quote from my professional life was; we once had to design a file system checker/fixer and we had a situation where a loop laden with C++ containers stuff took 64 secs to iterate through 750 x 7 million times in a O3 build. Over a few weeks, we replaced all the containers and class instantiations with hand optimized code in that while loop and were able to reduce this to 12 sec! I haven’t taken this issue up with Bjarne, because finishing the trading system/professional assignments are a higher priority for me at this stage in my life. I know... to prove my assertions, I have to spend quite a bit of time in getting some numbers; unfortunately I can't afford to spend time on all this...

Either way, I always advocate hand optimizing all the code that falls in the critical path. Since, as implied in my earlier post, all the transactions that happen in the US futures market go through this component that I am designing; this is definitely one of those critical path areas and so I wanted to know how others approach it.

rohan2008 · Dec 13, 2014

hft_boy said:
Main thing is to keep the keys small and easy to compare. If you can fit them in 8 bytes each, you can store an array of key,pointer_to_value pairs at 16 bytes each, total <1KB. If it's hot then it will usually be in the cache and you can search it performing at least 1 compare per cycle, so you'll worst case find the key in like 40 compares worst case, or <10ns on a typical Xeon core (actually because of pipelining and branch prediction I suspect it will be much smaller but I am being conservative). Then you have to dereference the pointer costing another hit to main memory (<100ns usually), or L2 cache if you keep your data structure small. That fast enough for you?

The main bottleneck here is the pointer dereferencing, not the key searching, assuming you keep your keys small and easy to compare (number of instructions matters!). If you use a hash table and there are any collisions that may cost you. I am guessing though that it will be roughly the same speed.
More...

This is interesting; totally got blind sighted here. Two questions

1. Since each cache line is 64 bytes, I was wondering if padding the table with some arbitrary empty bytes ensures that the array starts at the beginning of a cache line and ends in a cache line that doesn't have any other global variables... as I see updating any stray global variable that fall within the same line as the ContractTable(below) can invalidate the cache line in which part of the table falls... this can generate a cache miss.
Code:
typedef struct cont {
    char contract[8];
    void *contractObject;
} ContractTable;

typedef struct table {
    char empty[64];
    ContractTable table[40];
    char empty[64];
} Table;
2. If I want to setup a spin lock in ContractTable, I can't put a bool in the ContractTable structure and use gcc internsics since they can invalidate the cache...? How do you recommend spin locks here if I were to need one?

Any suggestions are highly appreciated...
thanks,

hft_boy · Dec 13, 2014

rohan2008 said:
This is interesting; totally got blind sighted here. Two questions

1. Since each cache line is 64 bytes, I was wondering if padding the table with some arbitrary empty bytes ensures that the array starts at the beginning of a cache line and ends in a cache line that doesn't have any other global variables... as I see updating any stray global variable that fall within the same line as the ContractTable(below) can invalidate the cache line in which part of the table falls... this can generate a cache miss.
Code:
typedef struct cont {
    char contract[8];
    void *contractObject;
} ContractTable;

typedef struct table {
    char empty[64];
    ContractTable table[40];
    char empty[64];
} Table;
2. If I want to setup a spin lock in ContractTable, I can't put a bool in the ContractTable structure and use gcc internsics since they can invalidate the cache...? How do you recommend spin locks here if I were to need one?

Any suggestions are highly appreciated...
thanks,
More...
The question I have for you is, why are you worrying about which container to use when you are using threads? Thread-to-thread communication latency is going to be at least an order of magnitude slower than this particular lookup. My suggestion is to just not use threads, then you won't have to worry about invalidating the line you're working on, and more about how hot the line is going to be (aka not thrashing the cache).. but that is just my opinion.

onelot · Dec 13, 2014

"I haven’t taken this issue up with Bjarne..."

your arrogance is the only thing stopping you from truly learning.

rohan2008 · Dec 13, 2014

onelot said:
"I haven’t taken this issue up with Bjarne..."

your arrogance is the only thing stopping you from truly learning.
More...

... re-read my sentence... yes it sounded a little arrogant... sincere apologies; "Bjarne" was a misnomer. Implied that I really didn't make an effort to contribute what I (along with my colleagues at my company) noticed to the open source community to whom I personally owe a lot.

rohan2008 · Dec 13, 2014

hft_boy said:
The question I have for you is, why are you worrying about which container to use when you are using threads? Thread-to-thread communication latency is going to be at least an order of magnitude slower than this particular lookup. My suggestion is to just not use threads, then you won't have to worry about invalidating the line you're working on, and more about how hot the line is going to be (aka not thrashing the cache).. but that is just my opinion.
More...

Ya, you are right; threading libraries have a lot of overhead. Concurrency can also be achieved when two processes running on two different cores share the same memory through mmap… similar to oracle’s latch design. I had this design in mind when I typed in the question… Even in that case, cache trashing seems to be an issue for test and set type instructions (ref below)… yes, guess I should stay away from concurrency after all to optimally exploit cache.... Thanks for the advice….

Fyi: https://andreynikolaev.wordpress.com/2009/10/04/general-spin-lock/

hft_boy · Dec 14, 2014

rohan2008 said:
Ya, you are right; threading libraries have a lot of overhead. Concurrency can also be achieved when two processes running on two different cores share the same memory through mmap… similar to oracle’s latch design. I had this design in mind when I typed in the question… Even in that case, cache trashing seems to be an issue for test and set type instructions (ref below)… yes, guess I should stay away from concurrency after all to optimally exploit cache.... Thanks for the advice….

Fyi: https://andreynikolaev.wordpress.com/2009/10/04/general-spin-lock/
More...

Yep, no matter which OS / runtime abstraction you use (threads, processes, etc), you can't escape the fact that you are ultimately using hardware synchronization primitives, and do all sorts of nasty things to your cache. I think if you use spinlocks to bypass the scheduler *and you are extremely careful* you can get hundreds of nanos instead of micros for intercore communication. But it is more difficult to reason about the determinism and will take 10x longer to write.

Honestly, the whole thing smells. The more interesting question is what latencies and throughput do you need? And architect around that. IMO there is not really any point to say, let's get this as fast as it can go because it needs to be really fast in the future, but rather, we need XYZ performance now, how can we get there. Each level of performance needs a special design anyways, you are going to save yourself a lot of time and headaches by not solving problems that don't need to be solved.

Sorry to go off on a rant and second guess your decision for you, just offering my opinion.