Mission Critical Resiliency and Scalability
Resiliency is more than surviving a limited set of common failures. While true closed loop automation requires a system that can survive repeated hardware, system, and software failures, the system must also robustly handle interruptions to input data including data loss, data corruption, and cyber attacks. Resilient systems must not assume that any service or operation is successful but must verify each transaction. When failures are detected, the system must be able to recover from the resulting unexpected or corrupt states.
Industrial scalability likewise requires a system that is both operational and performant even as the load, resource inventory, or environmental factor ingest data grows significantly.
AI is experienced in delivering highly reliable services in mission-critical domains. We have created systems capable of surviving failed compute, network, and storage services as well as failures of peer application and middleware services. Our tools have had to operate in highly dynamic and failure-prone environments while managing system loads and servicing continuous end-user and admin flows. While challenging, consistent use of autonomics, redundancy, microservices, defensive programming, and other design principles have allowed non-stop operation of mission critical platforms in spite of these obstacles.