‘Every company is a software company’ must be one of the most overused phrases in the B2B world, but the truth of it has probably never been more stark, especially when it comes to supply chain.
For there is no aspect of supply chain that is untouched by digitalisation, as organisations seek to create supply chains that can withstand the seemingly never-ending litany of disruptive events – starting with the pandemic, followed by the Suez blockage, labour problems, Russia’s invasion of Ukraine, rampant inflation, looming recession, extreme weather events, a US-China trade war, and now conflict in the Middle-East that threatens to suck-in other players in the region.
Whether it is a company reliant on supply chain software to determine inventory spikes at select locations, or an enterprise seeking an app to geo-fence its fleet of drivers to help cut costs and improve efficiency, the need for software is never-ending.
For any software solution to work – whether in supply chain or another business function – the key is extracting meaning from data, which means being able to query data, as fast and affordably as possible. Fail to do this, and a company is simply collecting data for its own sake and paying steep storage costs for the privilege of doing so.
Enter site reliability engineering (SRE), the field of IT that helps businesses maintain resilient systems that can respond quickly to the changing demands of customers.
“Without such systems businesses run the risk of losing out on potential revenue and customers due to downtime or slow response times,” says Alok Uniyal,VP & Head of IT Process Consulting Practice at Infosys. “SRE has emerged as an effective solution for building resilient systems by leveraging best practices from software development, operations, and system administration.”
In essence, SRE is a set of practices that focus on optimising the reliability of services and systems by applying software engineering principles to infrastructure and operations problems.
Whether it is a logistics company running a delivery management platform, an ecommerce company using micro-fulfilment software or an organisation whose procurement team relies on a P2P solution, SRE helps provide a framework to provide digital-system stability, even under high levels of usage and peak demand.
“This typically involves monitoring system performance, proactively preventing faults, automating the toil, responding quickly to issues when they arise, and regularly assessing potential weaknesses in existing systems,” explains Uniyal.
He adds: “SRE is also cost-effective, because by automating processes and improving their reliability, businesses can avoid the costly downtime associated with system failures. It cuts down on people-hours and allows organisations to reallocate resources towards higher-value activities, such as product development.”
Yet by its nature SRE requires a great deal of technical knowledge and sophisticated tools, neither of which will be available in every organisation. Plus, many businesses struggle to set up not only the processes needed for effective SRE but also the culture required to integrate SRE into existing systems and operations.
“This is why change management is a critical factor in the success of any SRE transformation,” says Uniyal. “At InfoSys we suggest certain best practices to ensure SRE delivers maximum benefits to an organisation.”
He continues: “It’s important to establish KPIs that are aligned to business goals. SRE teams should set clear service-level objectives that define the targets for service availability and performance, and then monitor these closely.
“They can be tracked using service-level indicators, which provide visibility into system performance in near real-time. Teams must also prioritise key performance indicators that align with business goals.
“These metrics should be reviewed regularly to ensure they remain relevant and effective.”
Uniyal adds that implementing “automated rollbacks” can mitigate the damage caused by a failed deployment. “Decoupling systems and services ensure that a system failure does not cascade down to dependent systems,” he says, adding: “Teams can also implement chaos engineering techniques to test the resilience of their systems.
“By introducing controlled failures into the system and assessing how it reacts, teams can proactively identify weaknesses and improve resilience.”
Other strategies Uniyal suggests to ensure a successful SRE roll-out include:
Preempting potential problems “This is achieved through continuous observing of systems and applications, proactive testing and reducing manual effort by using automation tools. SRE teams also work closely with development teams to identify potential issues in the development phase and eliminate them before they become actual problems.”
Leverage development operations “DevOps plays a critical role in enabling SRE by incorporating continuous testing practices and driving automation in the software development process. DevOps teams work collaboratively across departments, simplifying the software development process, and reducing the time it takes to deliver features.
Incidence response “Companies must develop incident response playbooks and processes prescribing remedial steps to address incidences when they occur. SRE teams should be trained in these processes, and regular drills should be conducted to ensure they are well-prepared to handle any incident that may arise.
“They should conduct blameless post-incident reviews to identify root causes, develop corrective action plans, and ultimately improve resiliency. Incident review provides valuable insights into system weaknesses, which teams can use to continually improve the system.”