What is Cloud SLA (Cloud Service-Level Agreement)? Its similar to a contract signed between a customer and a CSP, the Cloud SLA forms the most crucial and fundamental component of how security and operations will be undertaken.
The Cloud SLA should also capture requirements related to compliance, best practices, and general operational activities to satisfy each of these.
Within an Cloud SLA, the following contents and topics should be covered at a minimum:
- Availability (for example, 99.99 percent of services and data)
- Performance (for example, expected response times versus maximum response times)
- Security and privacy of the data (for example, encrypting all stored and transmitted data)
- Logging and reporting (for example, audit trails of all access and the ability to report on key requirements and indicators)
- DR expectations (for example, worse-case recovery commitment, recovery time objectives [RTOs], the maximum period of tolerable disruption [MPTD])
- Location of the data (for example, ability to meet requirements or consistent with local legislation)
- Data format and structure (for example, data retrievable from a provider in readable and intelligent format)
- Portability of the data (for example, ability to move data to a different provider or multiple providers)
- Identification and problem resolution (for example, help desk/service desk, call center, or ticketing system)
- Change-management process (for example, updates or new services)
- Dispute-mediation process (for example, escalation process and consequences)
- Exit strategy with expectations on the provider to ensure a smooth transition
What are the Cloud SLA Components?
Although Cloud SLAs tend to vary significantly depending on the provider, more often than not they are structured in favor of the provider to ultimately expose them to the least amount of risk.
Note the examples of how elements of the Cloud SLA can be weighed against the customer’s requirements.
- Uptime Guarantees
- Service levels regarding performance and uptime are usually featured in outsourcing contracts but not in software contracts, despite the significant business-criticality of certain cloud applications.
- Numerous contracts have no uptime or performance service-level guarantees or are provided only as changeable URL links.
- Cloud SLAs, if they are defined in the contract at all, are rarely guaranteed to stay the same upon renewal or not to significantly diminish.
- A material diminishment of the Cloud SLA upon a renewal term may necessitate a rapid switch to another provider at significant cost and business risk.
- SLA Penalties
- For Cloud SLAs to be used to steer the behavior of a cloud services provider, they need to be accompanied by financial penalties.
- Contract penalties provide an economic incentive for providers to meet stated Cloud SLAs.
- This is an important risk-mitigation mechanism, but such penalties rarely, if ever, provide adequate compensation to a customer for related business losses
- Penalty clauses are not a form of risk transfer.
- Penalties, if they are offered, usually take the form of credits rather than refunds. But who wants an extension of a service that does not meet requirements for quality?
Some contracts offer to give back penalties if the provider consistently exceeds the SLA for the remainder of the contract period.
- SLA Penalty Exclusions
- Limitation on when downtime calculations start: Some CSPs require that the application is down for some time (for example, 5 to 15 minutes) before any counting toward Cloud SLA penalty will start.
- Scheduled downtime: Several CSPs claim that if they give you a warning, an interruption in service does not count as unplanned downtime but rather as scheduled downtime and, therefore, is not counted when calculating penalties.
In some cases, the warning can be as little as eight hours.
- Suspension of Service
- Some cloud contracts state that if payment is more than 30 days overdue (including any disputed payments), the provider can suspend the service. This gives the CSP considerable negotiation leverage in the event of any payment dispute
- Provider Liability
- Most cloud contracts restrict liability apart from infringement claims relating to intellectual property to a maximum of the value of the fees over the past 12 months. Some contracts even state as little as six months.
- If the CSP were to lose the customer’s data, for example, the financial exposure would likely be much greater than 12 months of fees.
- Data-Protection Requirements
- Most cloud contracts make the customer ultimately responsible for security, data protection, and compliance with local laws.
- If the CSP is complying with privacy regulations for personal data on your behalf, you need to be explicit about what the provider is doing and understand any gaps.
- Cloud contracts rarely contain provisions about DR or provide financially backed RTOs.
- Some IaaS providers do not even take responsibility for backing up customer data.
- Security Recommendations
- Gartner recommends negotiating SLAs for security, especially for security breaches, and has seen some CSPs agree to this.
- Immediate notification of any security or privacy breach as soon as the provider is aware is highly recommended.
- Because the CSP is ultimately responsible for the organization’s data and alerting its customers, partners, or employees of any breach, it is particularly critical for companies to determine what mechanisms are in place to alert customers if any security breaches do occur and to establish SLAs determining the time frame the CSP has to alert you of any breach.
- The time frames you have to respond within will vary by jurisdiction but maybe as little as 48 hours.
Be aware that if law enforcement becomes involved in a provider security incident, it may supersede any contractual requirement to notify you or to keep you informed.
These examples highlight the dangers of not paying sufficient focus and due diligence when engaging with a CSP around the SLA.
Because these controls list a general sample of potential pitfalls related to the SLA, the following documents can serve as useful reference points when ensuring that SLAs are in line with business requirements.
They can also balance risks that may previously have been unforeseen.
What is Cloud Key SLA Elements?
The following key elements should be assessed when reviewing and agreeing to the SLA:
- Assessment of risk environment: What types of risks does the organization face?
- Risk profile: What are the number of risks and potential effects of risks?
- Risk appetite: What level of risk is acceptable?
- Responsibilities: Who will do what?
- Regulatory requirements: Will these are met under the SLA?
- Risk mitigation: Which mitigation techniques and controls can reduce risks?
- Risk frameworks: What frameworks are to be used to assess the ongoing effectiveness? How will the provider manage risks?
Ensuring Quality of Service
Several key indicators form the basis in determining the success or failure of a cloud offering.
The following should form a key component for metrics and appropriate monitoring requirements:
Availability: This measures the uptime (availability) of the relevant services over a specified period as an overall percentage, that is, 99.99 percent.
Outage Duration: This captures and measures the loss of service time for each instance of an outage, such as 1/1/201X—09:20 start—10:50 restored—1 hour 30 minutes loss of service.
MTBF: This captures the indicative or expected time between consecutive or recurring service failures—that is, 1.25 hours per day of 365 days.
Capacity metric: This measures and reports on capacity capabilities and the ability to meet requirements.
Performance metrics: This utilizes and actively identifies areas, factors, and reasons for bottlenecks or degradation of performance. Typically, performance is measured and expressed as requests or connections per minute.
Reliability Percentage metric: This lists the success rate for responses and is based on agreed criteria—that is, 99 percent success rate in transactions completed to the database.
Storage Device Capacity metric: This lists metrics and characteristics related to storage device capacity; it is typically provided in gigabytes.
Server Capacity metric: These look to list the characteristics of server capacity, based and influenced by central processing units (CPUs), CPU frequency in GHz, random access memory (RAM), virtual storage, and other storage volumes.
Instance Startup Time metric: This indicates or reports on the length of time required to initialize a new instance, calculated from the time of request by user or resource, and typically measured in seconds and minutes.
Response Time metric: This reports on the time required to perform the requested operation or tasks, typically measured based on the number of requests and response times in milliseconds.
Completion Time metric: This provides the time required to complete the initiated or requested task, typically measured by the total number of requests as averaged in seconds.
Mean-Time to Switchover metric: This provides the expected time to switch over from a service failure to a replicated failover instance. This is typically measured in minutes and captured from commencement to completion.
Mean-Time System Recovery metric: This highlights the expected time for a complete recovery to a resilient system in the event of or following a service failure or outage. This is typically measured in minutes, hours, and days.
Scalability Component metrics: This is typically used to analyze customer use, behavior, and patterns that can allow for the auto-scaling and auto-shrinking of servers.
Storage Scalability metric: This indicates the storage device capacity available if increased workloads and storage requirements are necessary.
Server Scalability metric: This indicates the available server capacity that can be utilized when changes in increased workloads are required.
What is Risk Profile?
The risk profile is determined by an organization’s willingness to take risks as well as the threats to which it is exposed.
The risk profile should identify the level of risk to be accepted, the way risks are taken, and the way risk-based decision-making is performed.
Additionally, the risk profile should take into account potential costs and disruptions should one or more risks be exploited.
To this end, it is imperative that an organization fully engages in a risk-based assessment and review against cloud-computing services, service providers, and the overall effects on the organization should it utilize cloud-based services.
What is risk appetite?
Swift decision-making can lead to significant advantages for the organization, but when assessing and measuring the relevant risks in cloud-service offerings, it’s best to have a systematic, measurable, and pragmatic approach.
Undertaking these steps effectively enables the business to balance the risks and offset any excessive risk components, all while satisfying listed requirements and objectives for security and growth.
Emerging or rapid-growth companies will be more likely to take significant risks when utilizing cloud-computing services so they can be first to market.
Difference Between the Data Owner and Controller and the Data Custodian and Processor
Treating information as an asset requires several roles and distinctions to be identified and defined.
The following are key roles associated with data management:
- The data subject is an individual who is the focus of personal data.
- A data controller is a person who either alone or jointly with other persons determines the purposes for which and how any personal data is processed.
- The data processor about personal data is any person other than an employee of the data controller who processes the data on behalf of the data controller.
- Data stewards are commonly responsible for data content, context, and associated business rules.
- Data custodians are responsible for the safe custody, transport, data storage, and implementation of business rules.
- Data owners hold legal rights and complete control over a single piece or set of data elements.
- Data owners also possess the ability to define the distribution and associated policies