Site Reliability Engineering – The New Ruler of the Software Management

Site Reliability Engineering – The New Ruler of the Software Management

Site Reliability Engineering – The New Ruler of the Software Management
 Site Reliability Engineering – The New Ruler of the Software Management 

 
The path to software delivery is laden with challenges and roadblocks. But once delivery to production is complete, another game starts.  It is a digital age with industry 4.0 revolution. Every business is a digital business. If their applications are down, then technically, their business is down.  If we go back in time, around 15-20 years back, Google was the pioneer in this area. During that time frame, Amazon matured rapidly, and that's how AWS as a new business was triggered. 
If Google capitalized in this area earlier, they could have been the market leader on the cloud platform.  Moving from the Cloud platform, lets come back to our topic of discussion - any hosting application has to factor security, availability, and scalability into their plans. Why have these factors recently become more significant? Site Reliability Engineering can address all of these factors.  
 
Download Also:

Why Site Reliability Engineering?

Site reliability engineering helps in estimating, preventing, and managing uncertainty and risks of failure. Although it cannot completely eliminate all failures, what it really does is evaluate the inherent dependability of an application (or process), spot outliers, and recommend actions to mitigate the impact of those failures.   Although delivering software applications is a complex endeavor, what’s more plaguing is ensuring they function in the production environment as intended.   Incorporating a handful of features into software applications does not guarantee its success. 
It depends on the ability of the production ops teams to ensure the above factors of the application - as proposed. Even companies like Walmart that deal with physical goods are heavily dependent on software. As mentioned above, software applications are no longer just support systems for businesses. They are mission-critical and, hence, reliability is the area of focus. 

What does Site Reliability mean?

Site Reliability Engineering, to a large extent, augments the capabilities of DevOps. I consider this as one of the categories of DevOps. The users of applications like Google, Amazon, and Netflix always expect security, availability, and scalability. If any parameter is compromised, it is a lost business opportunity.

Security and Privacy:  
• Users are concerned about both Security and Privacy.  
• Cloud (For example, AWS/Azure) brings certain good practices and frameworks. Private Data Centers managed by companies have their own challenges.  
• The breaches can be at 4 levels. Data Center, System, Application, Data. It can impact availability in case of a breach.
Availability:  
• The entire value chain of the applications has to be up and running.   
• Proactive Implementation: Monitoring, along with logging, play critical roles in detecting the issues proactively.   
• Reactive Implementation: Issue management systems like Jira Service Desk shall be in place so that users can report the problems.  
• Apart from the above, at the infrastructure level, Backup, Disaster Recovery, and Change Management processes (Blue-Green Deployments, Rollback) are very critical.

Scalability:  
• For B2C applications, the difference between peaks is quite high. Low resources will cause performance issues, and high resources can waste a lot.  
• Technologies like clusters (nodes), containers, and microservices are quite important along with scale up and scale down functionalities. 
 
This is where Cloud can be utilized at its best.  Practices, including tools, to manage these aspects is what Site Reliability Engineering is. Adopting Cloud technologies like AWS and Azure, will make this easy for any company.

Is SRE Applicable for every company? 

For any industry, and for any size, each company will fall into one of the below categories.      Software Product companies hosting applications for customers     
  • IT Service providers that host applications for customers     
  • Any company hosting applications for internal users     
  • Any company that doesn't have host applications      
  • Service Providers like Marketing consulting agencies     
  • Software / Hardware product companies ( OEMs )
For categories 1 and 2, it will be very critical to implement SRE with the highest maturity.  For category 3, SRE is definitely needed but not as critical as 1 and 2.  For category 4, SRE is not applicable. 
 

Important Metrics to track Site Reliability Engineering  

When embracing Site Reliability Engineering, it is important to constantly monitor, track, and measure the application across various metrics, to evaluate its reliability. Some important metrics include up-time, mean time to and between failure, mean time to repair, rate of failure occurrence, probability of failure and many others. These metrics help teams determine the level of software quality as well as the volume and variety of potential failures – so they can take steps to overcome issues in the quickest possible time.  

What does site reliability engineering focus on?  Assessing the inherent reliability of a software application and suggesting appropriate actions to mitigate issues requires teams to embrace certain concepts or practices.  Focus by the Operations Team:       
  • Measuring the Metrics      
  • Security Implementation     
  • APM Implementation     
  • ITIL or JSD Implementation     
  • Automation     
  • Cloud Migration
Focus by the Development or Engineering Teams:   
• Logging Framework  
• Scalable Architecture Delivering high-quality applications does not mean just high performance - it requires teams to also ensure applications are reliable. Engineering teams need to design and develop for reliability apart from the functional, technical, and regulatory requirements. You can read more on this topic in this great e-book on How Google manages SRE.  
Please comment below and let us know your thoughts on Site Reliability Engineering and Software Reliability, as we plan to explore more on these topics. 
Related Topics:
 
                                         Uday Kumar
About:
Specialist in Software Delivery and IT Operations. A generalist in Business Operations; an Intrapreneur ( Proactive, Adaptable, and Balanced ), who have built products ( / solutions ) and sold them apart building solid scalable teams.  Overall 17+ yrs exp. Worked at GE  (~8yrs) and working in Addteq (from last 8 yrs). Started as a first employee and currently working as BU Head (owning P&L).  Exp in various functions. Product Engineering, Project/Program Mgmt ( Products, Services [ outsourced, delivered ] ), Consulting, Presales, Product Mgmt, Sales, Marketing, Strategy, Service portfolio.  Few of my traits
 
* Always believe in learning. Life long shall be a student. 
* Simplify complex tasks (with basics / fundamentals approach). 
* Good at operatilizing ( 0 to 1 ), optimizing and scaling * Very Candid in discussions. 
* Enable the team members ( and sometimes customers as well ) to think. 
* Believe in Systems. There is a method to my work. 
* To improve quality, naturally see inefficiencies, errors and problems. 
* Strong in application of a theory learnt ( ex: Ops Mgmt theory to Team Productivity ) 
* Have very different perspective         
> Every team/function is like a manufacturing unit        
> Process is like Friction. It is an enabler (than an overhead) if used appropriately.           
> There is nothing called as Agile / DevOps culture            
> Agile Manifesto is not meant for Products, Scrum,SAFe frameworks are not meant for Services   
> There is no single DevOps product.       
> Scientifically measuring team productivity is not yet established. Without a baseline all the ROI for improvements (claims) are incorrect.

Post a Comment

Previous Post Next Post