How do you calculate Cloud ROI?

In general, there is always confusion between TCO and ROI. TCO is the Total Cost of Ownership and a component of ROI. TCO helps businesses focus on their core competencies and reduces the load on the IT department. When IT companies came out with the concept of managed services, they started looking at outsourcing their IT department’s functions to IT service providers. However, they retained minimal IT department resources and started reducing the Total cost of ownership. Cots products gave away inhouse resources of the IT department and further reduced the TCO. However, the real TCO reduction started when the cloud came into the picture. It totally reduced the capex cost and moved the infrastructure from the premises to a third-party premise. ROI on the other hand focuses not just on removing capex costs but also gives a return on investment for the customer. Let us look at the parameters that impact the ROI on the cloud.

1. Downtime: The single most important parameter that could possibly affect the return on investment is the downtime that happens across the IT systems. It affects customers, affects internal stakeholders and so on. Due to extensive experience, the public cloud service providers have drastically reduced the downtime due to Hardware Infrastructure. The second biggest reason for incidents is the application infrastructure. The cloud service providers not only abstracted out the hardware for the users but also abstracted the application infrastructure namely web servers, application servers, cache management, container management, application deployment and all these too. There were issues in deployment without a rollback plan. All the public cloud service providers apart from providing just hardware provided managed services for application infrastructure. Effectively downtime hours had been reduced by the cloud service providers both due to hardware as well as application infrastructure.

2. Agility: Agility is the ability to move quickly and easily. IT departments have been slow on deployments for the changes required by the business as the primary change is not only in the development code but the workflow that needs to change. The cloud service providers have got flexible workflows for deployment by bringing DevOps infrastructure into the cloud. The DevOps infrastructure of the cloud service provider provides Code Repository, Automatic Build, Automatic Deployment, and rollbacks followed by quick monitoring.

Autoscaling is an important parameter that is affecting agility. Quick provisioning of compute engines, creating security policies and attaching it to the compute engines, ability change at one place that gets reflected everywhere and so on. While Agility cannot be exactly defined in hours, the flow disruption occurs due to lack of agility.

3. Pay for what you use: In a traditional data centre model the resources are there even if you do not use it. A onetime irreversible budget was created. Irreversibility is the biggest issue in capex buying. There will be many idle resources and the capacity prediction has been a big problem. Cloud service providers solve all the problems of underutilized and over-provisioned resources. Hygiene issues can be automated. Resource usages are completely automated.

4. Technology deprecation: Every advancement in technology leads to cost reduction or capacity enhancement at a lower price. The cost of upgrading technology in a data centre environment is enormous. The process of doing POC, pilot and production is a long-driven process. However, public cloud service providers make it easy.

The final equation for ROI is always in terms of money. Converting the above tangible and intangible parameters into time lost or gained is important to calculate the return on investment. The ROI Equation will have an investment component. It has a traditional investment component as well. So, a simple way to calculate cloud ROI is as follows,

Note: The step no (c) is the most difficult parameter to arrive at to deliver ROI.

The Cloud Operating Model – Optimize Costs, Enhance Availability, Drive Performance

The cloud operating model is a way the workflows are defined on the cloud to achieve the IT operational goals which are quite different from what has been happening in the data centre. Cloud paves the way for moving the infrastructure out of the data centre.  An abstraction layer on hardware is created with a hypervisor which hosts the guest operating systems. These hypervisors are getting replaced by container images nowadays to ensure better portability of the applications. The operating model of the data centre has many layers to ensure availability, whereas the cloud operating layer has fewer fine-grained models to ensure availability. The cloud operating model is entirely based on the data centre operating model and present challenges as well as opportunities to improve the performance. Some advantages in the cloud operating models include.

SLAs: In a typical data centre operating model SLAs are a function of all the layers of Infrastructure starting from the cables, power supply units, and so on to the point where the applications are hosted. In the Cloud Model hardware availability, SLAs are taken off the CIO’s hands and it is usually by default 99.99%. If we are using the platform from the cloud service provider, the underlying application infrastructure SLAs are taken care of by the service provider. Now the application availability is the only SLA that needs to be ensured. Managing the load balancers, firewalls, cache have all been taken off the IT services plate. A good example of improved SLAs is the usage of Gmail and Microsoft O365 platforms. Mail delivery has become more reliable and people focus more on new features. Improved availability of software and hardware have made life much easier for IT people.

Scalability: The hardware scalability is out of the door and the need for various types of hardware going through RFPs and procuring the same is gone. However, as every solution brings in a new problem scalability has brought in the problem of cost. Indiscriminate usage of auto-scaling on the cloud has led to increased cost, making the operations unviable. So, the cloud operating model requires very robust auto-scaling policies.

Cost: A typical data centre has a budget forecasted at the start of the year and the cost is managed by the budget set at the start of the financial year. In the cloud, cost is an engineering problem. It can continuously beleveraged, and we have an opportunity to spend lesser by using cloud resources more diligently. Cost is not a one time exercise but is a continuous exercise.

Services Options: The no of services that are available from the cloud are mind-boggling. No two services from two different vendors are equal. AWS has more than 150 services, Azure has hundreds of services and GCP too has just 90 services. Every vendor has an approach to abstract the hardware and software out. Understanding of the service options is paramount to a good operating model. Unlike data centre, the options to improve service are continuously available for a cloud services provider.  Managed services option is provided by almost all cloud service providers. The focus of managed services is not on our IT applications but on hardware by default and more on application infrastructure, namely database, authentication service, caching service and so on.

Security: The cloud operating models provide security at the infrastructure and platform software levels by default. Most of the cloud service providers take care of network security as well. However, in the cloud operating model, security becomes a shared responsibility, and it is not the responsibility of the provider or IT department alone. Cloud Service providers still ensure a robust security mechanism to protect the data, devices, and application infrastructure.

Support: The support model for cloud service providers is different and it depends on the choice of subscription. It is important for us to go through the entire subscription model and choose what the business needs are as well. The Business Continuity both at the geography levels as well as at the zone level are available by default. Disaster recovery and data replications are features that makes the cloud operating model a better model than data centre operating models.

To conclude, it is important to unlearn some of the data centre operations metrics and learn new metrics on the cloud operating model to provide better service to our customers. Cloud operating models are different and at every level of abstraction the cloud service providers add to the underlying layers the operating model becomes different.

Ten DevOps Metrics IT departments should Track

The evolution of DevOps Engineering has led to the practice of tracking useful metrics that bring in cost efficiency for the IT department of an enterprise. There are a few properties of the DevOps metrics namely – Measurable, Relevant, Incorruptible, Actionable and Traceable. Complying to the properties of the metrics will help in high uptime for the IT as well. This is applicable for any set of metrics that we are tracking in an IT environment and not necessarily DevOps Metrics. Let us list some of the important DevOps Metrics.

  1. Deployment Frequency: It is important to understand how frequently a deployment is made as it could reflect the possible bug fixes or change velocity or quick feature changes based on the business requirements. High Frequency is a double-edged sword as the frequency indicates agility as well as the possibility of code stability, testing ability and so on.
  2. Deployment Volume: The deployment volume indicates the new features or bug fixes as it is difficult to go through the Bugzilla-like software to check what is going on in the application. Frequency and Volume is an input to more meticulous monitoring required for the applications and the infrastructure.
  3. Lead Time to Deployment: This helps in planning the deployment from the time work starts for deployment to the time it is deployed in production. If one does not follow the blue/green deployment, there could be a possible impact on planning a deployment. A high lead time leads to more downtime for the applications and therefore the business.
  4. Deployment Failures: Deployment failures indicate the inadequate testing; code stability and frequent deployment failures will require reviews to ensure that the stability and reliability of the applications are restored.
  5. Ticket Volume: The no of tickets against an application indicates the stability from the code perspective as well as the usability perspective. Depending on the type of the tickets, if the ticket volumes are high, the issue with respect to the application needs to be addressed. Ticket volume is a good indicator of stability and tells a tale of staleness in the application.
  6. Volume of Production bugs: While a lot of bugs are caught during QA some of the bugs occur in production. The production bugs are the one that impacts the customer and therefore results in loss of business. Productions bugs have the attributes of urgency and importance to be tracked.
  7. Mean Time between Failures: The mean time between failures of the application is another indicator of the stability of the application. Ability to trace and track the reasons for failure is an important metric to ensure that the mean time between failures is usually high.
  8. Mean Time to Recover: The mean time to recover from failure will help identify the resiliency of the application. Failures can happen in production or during deployment but the ability to recover or repair after failure ensures business continuity of the applications.
  9. Mean Time to Detect:Issue resolution in IT can be effectively divided into two parts. Time to detect an issue and time to repair an issue. The time to detect an issue takes us to a place where the issue has happened and the time to repair starts from that place. Usually, the time to detect takes more time than the time to repair. Tracking the time to detect helps in reducing the time to recover an application and therefore minimizing the impact to the business.
  10. Ratio of actionable alerts to Total alerts: Most of the IT environments have monitoring systems which give alerts when the IT system resources deviate from normal behaviour. The goal of the monitoring system should be to give only actionable alerts and suppress noise alerts. It is important to track the ratio as the industry norm requires the ratio of actionable to total no of alerts be 2%. Effectively, if there are 100 alerts only 2 are actionable.

With the DevOps culture picking up in most organizations it has become important to track the DevOps metrics to make the IT Operations of the enterprise more efficient. DevOps dashboards should give way to better productivity.

Data References:

Multi Cloud – The new Paradigm shift in Cloud Transformation

Gartner’s 2020 Cloud Service Providers Magic Quadrant Report says that the leading players in the public cloud space are AWS, Azure and GCP. By the year 2026, it is estimated that the public cloud provider business will be in the range of 488 billion dollars annually. Cloud service providers provide the infrastructure, tools and software needed to run the business. While the applications are getting environment independent using the container technology which allows the users to migrate application from one environment to another environment in a seamless the cloud service providers do position some of the tools that in turn leads to a vendor lock-in situation. Apart from this, there are multiple reasons why enterprises wanted to manage multiple cloud service providers. The idea of this blog is to see the reasons why customers want to use multiple cloud service providers.

Reasons for Multi Cloud Options.

  1. Customers want to avoid a vendor lock-in. Usage of a specific tool or infra continuously leads to locking the customer with a vendor. Customers want to be vendor-agnostic on Infrastructure and applications.
  2. Certain services are done best by certain vendors and the customer needs the best of the breeds of services from different vendors. As an example, the AI infrastructure is delivered best by Google as Google is a leading AI Player. Some of the best developer tools are available from Azure and AWS provides the Auto Scaling feature possible.
  3. Cost is another factor for which the customers are migrating into multiple cloud environment. As an example, GCP gives per second billing on usage and AWS gives hourly billing. New entrants give better pricing and customers look for workload migration. Cloud is a continuous cost leverage unlike, data centre which is once in a year model of budgeting for IT.
  4. Niche services are provided by certain vendors. Many customers are adopting SaaS model and none of them wants to worry about managing the cost of infrastructure and platform. As an example, sales force leads the CRM space with their SaaS offering. ServiceNow in the ITSM space. However, adopting a SaaS model paves way for vendor lock-in as it is difficult to migrate applications once a customer is on SaaS. However, most enterprises have a combination of SaaS and Public Cloud Service Providers.

Managing Multi Cloud Environments

Managing multiple clouds is not easy as the resources required are different and therefore is a costly affair from a resource perspective. Each cloud service provider has a different set of products and no two products are designed to be equal. To find an equivalent product is an exercise by itself. Here are a few tips to manage multi cloud environments.

  1. Abstract the service: Define your services independent of the cloud service provider and try and see what all products can fit into your abstraction. Assume that you need a compute engine. Define the compute engine configuration and choose the equivalents from multiple cloud service providers. Service Abstraction is an important part of multi cloud management.
  2. Cost Management: Cloud cost management is an engineering problem and not a finance problem. In a multi cloud environment, it is important to choose a tool that can help us identify the spend issues on various platforms. As an example, cloud health from VMware gives a perspective of the resource utilization in various clouds.
  3. Reporting: Managing Multi cloud environments leads to receiving multiple reports and it is important to configure relevant reports of interest.
  4. Define the Dependencies on Cloud Service Provider: It is important to define the dependencies on the cloud service provider. As an example, if you are dependent on AWS Cloud Watch for your alerts and not any other third-party tools it is better to define the same.
  5. DevOps Management: All cloud service providers make the DevOps toolchain much easier than the private cloud or data centre DevOps Tools Chain. The code pipeline is easy to set up with a cloud service provider unless a specific tool is used on-prem. There is a good amount of dependency on the cloud service.

To conclude Multi Cloud is a great option which one should exercise before implementation to make sure the IT Infrastructure and applications are portable.

Cloud AI Infrastructure – Redeem Better ROI from AI

The exponential growth of AI in the last decade has happened due to two specific reasons. Growth in Processing power and Growth in Storage power. Until 1999 the growth of processing power had been slow but due to exponential growth of processing power with the invention of GPUs and TPUs, AI has truly come out of the ‘AI Winter’. Kryder law states that the disk drive density tends to double every 13 months, and this has truly reduced the price of storage. With 5G technology on the anvil, the need for storage will further reduce as low latency networks will be the order of the day and the need for storage at multiple places for processing will be further reduced.

There are four powers necessary to contribute to AI’s success. Namely, Processing Power, Power of Algorithms, Domain knowledge and Data. Many enterprises want to take advantage of AI. However, transformation using AI has become a costly affair using capex systems. Usually, high powered CPIUs used by AI programs are normally not required to run the normal day to day transactions. A capex investment becomes a big problem for many businesses to deliver ROI using AI Technology. Cloud technology has come as a saving grace for enterprises to deliver ROI on AI.

To start with, IBM Watson platform has been delivered only on cloud. An on-premises version of Watson is not typically available. Many enterprises have taken advantage of Watson’s AI Capabilities specifically for prediction purposes. Watson offers Infrastructure as well as machine learning models to deliver outcomes.  Other cloud companies have not been far behind in delivering service from the cloud. AWS SageMaker made an entry where large machine learning models can be built, trained, and deployed using the same. Amazon Comprehend, Amazon Code Guru etc helped AWS to offer a range of services on machine learning using their hardware.

Microsoft Azure offers a comprehensive list of AI services. This includes speech, vision, and text analytics services. Apart from this, Microsoft offers a comprehensive set of tools from a machine learning studio to enable DevOps for AI code deployment. Microsoft Azure also offers Bot Services to build a bot, deploy a bot and train a bot.

The widest range of services are from Google Cloud. Google Cloud separates the AI offering in the form of 4 services. Google offers scalable AI – Infrastructure services that enable compute engines to be attached to a GPU by default and allows scaling to a TPU. Google Compute Engines run with Nvidia processors. Google also offers horizontal APIs for Text search, Computer Vision using Vision AI and a set of Speech AIs. Google being the largest contributor to Open Sources has come out with Google BERT, making the services more robust. Google has been an AI-first company apart from a search company. GCP offers wide-ranging vertical-specific out-of-the-box AI solutions. They offer solutions on Healthcare and Life Sciences, Data Cleansing, Data Labelling and Data Analytics. GCP focuses on the complete lifecycle of AI, from data collection, data cleaning, data labelling to training the data with scalable infrastructure and deploying the Machine learning code using Kubeflow.

Cloud service providers continue to evolve with new services. The ability of the cloud service providers to provide high-end computing engines on-demand gives enterprises an opportunity to transform their business using AI. They offer a comprehensive set of AI tools and frameworks, creating an ecosystem that adds value while delivering the service.

Data References: