From cloud to OCP? Be ready to wrangle firmware

2 weeks ago 2

Warren Buffett’s insurance giant GEICO has cut costs by “50% per compute core and more than 60% per gigabyte of storage” by moving to Open Compute Project (OCP) open hardware and OSS software.

That’s according to the hardware team at the automotive insurer – which made waves last year by revealing it was repatriating the majority of cloud workloads after cloud bills hit over $300 million a year.

The news comes after owner Berkshire Hathaway praised GEICO’s “significantly improved operating results” in an annual report this year.

It has not been an easy journey though, surfacing “several operational gaps and challenges that required deep investment in tooling, processes, and internal expertise,” GEICO said today. 

Among the biggest: The shift to Redfish-based fleet management and the need to “build firmware lifecycle automation from the ground up” in a way that accommodated “differences in BMC features, IPMI behavior, and firmware stability” across hardware.

Among other moves GEICO partnered with Taiwan's Wiwynn to develop modular, extensible base platforms for storage and compute that fit in its OCP spec ORV3 racks.

1,000 OCP servers now live

The non-profit OCP publishes specifications for those building IT systems on open, commodity hardware; with a focus on efficiency, impact, openness, scalability, and sustainability.

GEICO is working to repatriate “at least” 50% of its cloud workloads by 2029 with the aim of consolidating “compute and storage on a common hardware platform” that can power a flexible, homegrown, open-source [OCP] private cloud.

Cloud and OEM: Challenges abound

Whilst recognising the benefits of the public cloud, GEICO said that it had also encountered “capacity constraints, service deprecations, and regional or quota limits that required significant unplanned work to meet business demand…”

Simply shifting to OEM hardware to underpin a private cloud also comes with issues, its team said, pointing to “proprietary control planes and lifecycle managers” that cramp operations.

Further:

“When infrastructure spans multiple OEMs, operations fragment across separate tools and release cadences, producing duplicated processes, inconsistent interfaces, parallel automation pipelines, and divergent inventory and telemetry models.” - GEICO

It is shifting to modular, open-source hardware and software as a result – saying it made the after rigorous TCO analysis.

“Side-by-side quotes from OEM offerings, public cloud services, and OCP platforms sourced through ODMs were normalized to unit economics and modeled over a multiyear horizon. The analysis included server acquisition, rack power and cooling, colocation space, lifecycle operations, support, software licensing impacts, and depreciation. Sensitivity tests were applied for utilization, energy price, and component refresh...”

In a new whitepaper presented at the OCP Global Summit in San Jose this week, GEICO’s team shared lessons from building “two new colocation facilities with more than 1,000 OCP servers” based on OCP Open Rack version 3 (ORv3) architecture.

The 29-page whitepaper, authored by GEICO’s Jason Holpuch (senior hardware engineer), John Hilt IV (head of data centre and operations), Ryan Chow (firmware engineer),and Sahid Jaffa (senior director of engineering) details their efforts to-date.

It hasn’t all been plain sailing… 

Finding power workarounds… 

It reveals that GEICO had to create its own hybrid ORv3 power system “so traditional AC-powered networking gear can coexist with OCP-native systems” – after finding that “many traditional devices such as top-of-rack and management switches were not designed for ORv3 racks and required IEC 60320 C13/C14 outlets, with C19/C20 where higher power is needed…”

Fleet and firmware management also “emerged as critical gaps requiring new processes and expertise,” they said. For example, GEICO uses OpenStack for bare metal provisioning, but found that integration with “ODM” hardware that it sourced for its co-los was a challenge, “particularly when BMC firmware lacks full support for Redfish or exhibits inconsistent behavior…”

Six key challenges for the OCP

Their efforts have revealed some of the major challenges for those seeking to follow suit. And the OCP community needs to address six key issues if it wants to continue to “lower barriers to entry and make hyperscale-grade efficiency, flexibility, and cost outcomes attainable for the everyday enterprise,” GEICO said. 

Many of them involve the provision of better guidance and “starter kits”. Those six challenges for the OCP community are:

  1. Supply and SKUs: Create predictable pricing and lead times for orders of 50 to 500 hardware units. ‘Pre-cabled first-rack pilot kits” would be nice too… 
  2. ORv3 power: Standardize hybrid ORv3 power to support AC switches (C13/C14) during transition.
  3. Firmware: Ship enterprise-ready OpenBMC with LTS, signed images, an SBOM, and security SLAs. Redfish conformance for automated BIOS and BMC rollout/rollback.
  4. New Product Introduction (NPI): More guidance, better starter kits, with thermal and reliability guidance for dense rack configurations, including networking components.
  5. Fleet operations: Auto-discovery plugins, reference jobs for secure firmware operations, image signing, staged rollouts, UEFI Secure Boot, and end-to-end chain-of-trust patterns suitable for regulated industries are needed said GEICO.
  6. Support ecosystem alignment: “The OCP community should standardize cross-vendor validation and badge enterprise-ready, pre-integrated and validated solutions from OEM/ODM suppliers,” its team also urged. 

Want to follow suit? Start hiring...

Organisations wanting to follow suit and take back control of their stack will need to “plan for additional roles that are not required in a classic OEM relationship” GEICO’s team warned.

That includes “low-level hardware engineers familiar with hardware description language, firmware engineers for BIOS and BMC integration”, staff to handle Engineering Validation Testing (EVT), Design Validation Testing (DVT), Production Validation Testing (PVT), as well test automation engineers to build “repeatable pipelines for inventory, telemetry, and updates.”

Conclusion?

Shifting to Open Compute Platform systems is a compelling opportunity to “achieve hyperscale-grade cost efficiency, platform flexibility, and infrastructure transparency”. 

But, GEICO’s team cautioned, “adopting open hardware is more than a technical decision. It is an operational and cultural transformation that demands maturity across procurement, planning, legal, and engineering… Enterprises must be prepared to navigate contract governance, supply chain coordination, inspection…processes, and compliance requirements.” 

Join peers following The Stack on LinkedIn

Read Entire Article