Monday, May 31, 2021

Why are enterprise systems so terrible?

Large enterprise IT systems have a bad reputation for being complex, costly, slow, opaque and inflexible. In my 15-year career in investment banks, I have yet to come across anyone who is happy with their own company’s IT infrastructure. The problem seems to be not only in banking, but across all industries. Why is that? 

First, we need to understand what an enterprise system is. Not all software used in an organization qualifies as an enterprise system. For example, Excel is not considered an enterprise system. An enterprise system usually share (most of) the following characteristics:

  • Distributed, designed to handle large volumes of data and/or computation.

  • Mission critical, with low tolerance for service disruptions.

  • Solves enterprise wide problems, services multiple teams and departments.

  • Changes are constantly made, in response to evolving business requirements.

  • Often regulated, must follow certain industry and government standards.

A bank’s trading and risk management system is a good example of enterprise systems, it features all the attributes above. These characteristics dictate that a good enterprise system should be scalable, reliable, transparent and modifiable.

Let me point out an important fact that is often taken for granted: the runtime of any general purpose programming language today is essentially a stack-driven state machine (SDSM). What do I mean by that? The source codes of every modern programming language are structured as a collection of functions. The call sequence of these functions at run time is determined by call stacks, usually one for each running thread. Call stacks store the essential information for determining the correct call sequence of the functions at run time. When a function is called, its calling parameters, local variables, and return values are pushed to or allocated at the top of the stack; when the function exits, its states are popped off the top of the stack, the program returns control to the enclosing function whose state becomes the new stack top. The shared state of the program is kept in the heap, which can be accessed and modified from any function as long as it holds a reference (i.e., a pointer) to the right address in the heap. 

The stack driven state machine (SDSM) is behind the execution of almost every general purpose programming language because the stack is the most efficient data structure for managing the call sequence of functions. However, the SDSM runtime also has some notable limitations: 

  • The call stacks and heap are created and updated dynamically at run time, their state is highly path-dependent and unpredictable. Different execution paths may take place at run time depending on the input data. The SDSM itself does not maintain any data structure to help predict a program's future runtime behavior.

  • The SDSM runtime does not keep track of data dependency or lineage, neither are intermediate results persisted. Once data are popped from the stack, or de-allocated from the heap, they are no longer accessible. Any caching, lineage or reporting of intermediate results can only be implemented as part of the application logic.

  • The stacks and heap of SDSM must live in the same memory address space, which means that SDSM only has information about a single running process, thus offering no support for any multi-process execution and communication. As a result, any inter-process communication and coordination can only be implemented at the application level. 

A direct consequence of these limitations is that SDSM cannot support predictive optimizations. By predictive optimization, I mean those that only pay off under certain specific runtime outcomes, but not for all. Therefore, some knowledge or prediction on future runtime behaviors are required in order to determine whether these optimizations should be applied. The following are some examples of predicative optimizations:

  1. The result of a function call is cached for certain input parameters, knowing that the same function will be called again with the same inputs. 

  2. An input data is changed, only those affected parts are updated instead of rerunning the entire calculation, knowing the full dependency on that input.

  3. Memory is pre-allocated for better memory performance, knowing the future memory consumption of the program.

None of these optimizations are possible within the SDSM runtime, as it does not track any data to predict future run time behaviors. As a result, predictive optimizations can only be implemented manually at the application level by developers.

When writing single process applications, developers can often reason about and predict their runtime behaviors to some limited extent based on experience, then implement corresponding predictive optimizations, such as caching or memory optimizations. However, as an application grows in its complexity, its overall run time behavior often becomes too difficult to predict from experience. Instead developers have to rely upon tools such as profilers or benchmarks to gather runtime data on the overall applications for better predictions and optimizations. 

Predictive optimization is critical for the success of any enterprise system. Compared to single process applications, the stake is much higher in an enterprise system, and there are many more design and implementation choices to make, such as system topology, communication protocols, caching, distribution, load-balancing and redundancy strategy etc. Each of these choices could lead to very different trade-offs in a system’s cost, reliability, performance and scalability. It is therefore extremely important to carefully optimize the enterprise system for its expected runtime needs. A bad design or implementation choice could easily make an enterprise system unfit for its purpose, and such mistake can be extremely expensive to fix once the system is built.

Predictive optimization of an enterprise system is orders of magnitude more difficult than a single process application, because:

  • The runtime behavior of an enterprise application is much more difficult to predict, as it often involves many processes running on many computers. The benchmarking and profiling tools are unreliable as there are too many uncontrollable variables that can affect its runtime behavior, such as synchronization, network latency, work load, database response time, 3rd party services etc. 

  • The changing business needs have to be considered. The enterprise system not only has to be optimized for a company’s current needs, but also for its expected future needs. We all know how difficult it is to predict the future.

  • A firm usually runs multiple distributed applications, each of them could be developed and supported by a different team, department or vendor, with different priorities and objectives. These distributed applications are often built using shared components and services across the firm, such as databases, workflows and computation engines etc.

The predictive optimization for enterprise systems is therefore an extremely complex high dimensional optimization of many interdependent components, services and applications, with multiple objectives and constraints, and ambiguous future business needs. The objectives and constraints can be either technical or business, the latter may include the resource, budget and project timeline etc. 

Such a complex predictive optimization problem itself is challenging enough; what makes it truly terrible is that it has to be done manually at the application level without reliable runtime information due to the limitations imposed by SDSM runtime.  In practice, the manual predictive optimization for distributed applications can manifest as various work streams and activities among teams of people in an organization, such as architecture/system design, implementation, project planning and management, performance monitoring and tuning, as well as support and maintenance etc. These seemingly unrelated work streams are often just different aspects of manual predictive optimization of the enterprise systems. 

Despite the monumental efforts involved, manual predictive optimization rarely results in an optimal enterprise system, oftentimes it doesn’t even produce a reliable one, mainly due to the inability to to accurately predict the runtime behavior and future needs of a distributed enterprise system. After the system is built, making material changes is also difficult and time-consuming, as the manual predictive optimization has to repeat for the changes. This explains why most enterprise systems today are complex, expensive, opaque, difficult to modify, and poor in performance, scalability and reliability.

By now, it might surprise you that the unavoidable logical conclusion is that today’s programming languages are responsible, at least in part, for our terrible enterprise systems. Let’s recap the entire chain of logic in a single diagram:



from Hacker News https://ift.tt/3fCTI7t

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.