Introducing STARK: Our In-House Solution
To address these critical issues, we embarked on building our in-house risk management system, STARK. Our design principles focused on overcoming the vendor's shortcomings:
Data Consistency through Snapshots: Our new design prioritized data consistency as a foundational architectural principle.
Instead of generating individual measures, STARK generates atomic snapshots. For every portfolio event (account balance update, futures trade, etc.), STARK calculates all the immediate and cascading risk measures as a result of that event, and bundles them together into a single, consistent output message. In addition, every generated snapshot message also includes references to the input data, ensuring complete traceability. We know exactly what combination of inputs contributed to every risk measure output.
STARK generates comprehensive snapshots where all related risk measures are bundled together and internally consistent. For every portfolio event, the system calculates and publishes a single, traceable snapshot of all immediate and cascading risk measures.
Low Latency and High Throughput: By hosting all infrastructure on-premise, we minimized network latency.
To achieve minimal latency and high throughput, we made a conscious decision to trade distribution for speed. While a distributed architecture offers ultimate scalability, it introduces communication latency. We chose to squeeze the majority of the computationally heavy processes into a single, powerful risk calculation engine.
This centralized design was not a guess. We performed a Proof-of-Concept load test pre-implementation to verify that a centralized design is able to support our current and projected future load, allowing us to confidently move forward. The design itself is proven to scale horizontally.
The Result: The average data staleness plummeted from >10 seconds down to 10 milliseconds.
Rapid and Deterministic Recoverability: In a 24x7 trading environment, recovery time is a measure of operational resilience.
Leveraging a Kafka event stream for all input and output data, STARK ensures deterministically sequenced events. When a crash occurs, the system's state can be reliably and quickly restored by starting with a periodic state snapshot (leveraging our Prime Brokerage's framework) and then replaying only the historical events that occurred since that snapshot.
Furthermore, the service runs an active/passive redundancy model, ensuring at least one passive instance maintains the exact same state as the active one at all times. If the active instance unexpectedly fails, the passive one takes over instantaneously.
The Result: The recovery process now takes a matter of seconds, not hours.