Cloud platform engineering isn’t easy, but when it’s done right the business benefits make it worthwhile. In this post, author Ashley Petherick we shares insights into the lessons learnt from building a cloud platform for a FTSE 100 financial services business.
About the Cloud Platform
The cloud platform we’re focusing on here supports application developers and testers. It underpins a multi-stage software delivery lifecycle (SDLC) for development and testing in production and non-production environments.
Our goal was to enable developer teams to deploy an enterprise application through iterative, collaborative, and secure use of CI/CD pipelines. Steps were also taken to encourage a lean, agile approach to delivery. Ultimately, we enabled the business to modernise an important but ageing application as it transitioned from on-premise to cloud. Using DevOps and automation practices, we empowered developers to work more effectively. Outcomes ranged from higher productivity to improved quality and velocity as well as increased uptime and better business agility.
Lessons Learnt Building an Enterprise Cloud Platform
We’re sharing eight key learnings from the programme to help other organisations embarking on SDLC platform creation.
1. Aim for Self-Service
Giving developers direct access to platform features and enabling application code deployment through automation pipelines reduces constraints and boosts productivity. Access can be closely managed with robust security controls such as role-based access and granular service connections. A secure platforms and pipelines approach enhances developer access and reduces constraints to improve productivity while managing security risk.
For instance, a foundational feature of self-service is the provision of secure access to Key Vaults containing secrets, certificates, and key management. At scale, the need to upload, renew, or doublecheck these artefacts can lead to an unmanageable volume of developer requests. This results in extended lead times which impair developer productivity, while platform engineers become bogged down with low-value manual work. Enabling developers’ access to appropriate SDLC stage Key Vaults increases their autonomy and independence, while maintaining security.
A self-service approach enables developers to be more independent and efficient, deploying code (and generating business value) faster. This approach also benefits platform teams, removing toil and releasing time for more rewarding, high-value work. It achieves high leverage efficiency ratios too: the effort required to provide automated self-service features to ten developers can scale to one hundred developers or more.
2. Determine Pipeline Architecture at the Outset
An enterprise-scale SDLC platform requires an infrastructure and application architecture for the environments created. Platform engineers should also determine the pipeline architecture early, as the volume, complexity, and interdependence of pipelines can quickly become difficult to manage.
Consistent use of common pipeline configurations (such as staging, templating, the use of ‘don’t repeat yourself’ (DRY) code), and secure use of variables and secrets ensures pipelines are stable and secure. This allows platform team capacity to scale with reduced risk while increasing the success of self-service uptake.
3. Apply Production Practices to Everything
The platform we’re focussing on provides environments for development, user acceptance testing, system integration testing, performance testing, pre-production, and production. It’s important to consider who will use these environments, along with their expectations and requirements.
As far as possible, production practices should be applied to all non-production environments. Developers are unlikely to tolerate a development environment that regularly suffers downtime and disruption that harms productivity and hinders their ability to meet sprint goals.
Informing development or test team customers of infrastructure deployments is a given. We also learnt the importance of providing appropriate levels of observability and monitoring in non-production platforms. It’s best to be proactive and enable visibility of performance, reliability, and availability. A production-like dashboard view of compute resource use (e.g. disk, processor, and memory) across AKS clusters and the data platform encourages frictionless and frequent analysis. This observability encourages proactive avoidance of platform disruption which might otherwise impact developer productivity.
It’s well-known that development environments for the creation of application code must be available, stable, and reliable. But what about the environment where platform features are developed? A platform team sandbox allows infrastructure or platform code to be written – and tested – before deployment to the development platform. Sandboxing should ideally encompass the full suite of applications deployed and include automated testing. This ‘shift-left’ approach ensures problems are detected early, reducing their impact and the effort required to resolve them. It also provides the platform team with an area to experiment, without risking downtime in the developer environment.
4. Monitor Development Platforms like Production Platforms
Monitoring is not usually implemented on the development stages of SDLC due to the high risk of false positives. Development environments can be ephemeral, frequently destroyed and recreated, potentially triggering a high volume of unnecessary alerts. So, the cost of wasted effort is an important consideration, as well as the baseline cost of infrastructure monitoring.
Nevertheless, when development environments are unavailable, developer teams are inconvenienced and unable to deploy application code. A balance needs to be struck, with production-like dashboards and platform monitoring configured against development and test environments as appropriate.
Build agents are a case in point. This software is critical to modern tool chains such as Azure DevOps, and the ability to deploy code should be afforded the same level of priority as production. When developers and platform engineers can’t run pipelines, the system is significantly impaired, the application can’t be developed, and it’s impossible to deliver new features or address issues through hotfixes or patches.
5. Use Automated Testing on the Platform, not just the Application
Automated testing is critical to rapid, secure, high-quality deployments. This should extend beyond the application itself to encompass focused unit and smoke testing on all platform features of the SDLC.
Continuous automated testing ensures deployed features do what they should, while bringing confidence that the platform is robust. Quality is maintained, platform stability is upheld in the face of continual change, and platform team delivery times are reduced.
6. Ensure Team Productivity is Observable
The dashboards of infrastructure monitoring tools provide valuable insight into trends surrounding the use of compute resource. Agile delivery practices also advocate graphs to display team productivity, as well as workload flow and dynamics. For example, velocity charts can be used to convey the volume or complexity of deployments per sprint.
These powerful visual tools can quickly summarise trends, stimulating action to address any issues that are emerging. Graphs can be far more effective than text-based descriptions. A dashboard of relevant metrics enhances awareness of any hotspots, encouraging the team to be proactive about solving constraints.
7. Keep Technical Debt Visible
In a fast-paced programme using agile and lean practices, technical debt is inevitable and needs to be managed carefully. However, it’s not always obvious how much is building up, how complex it is, what risks are being created, and what the impacts might be.
It’s good practice to keep a technical debt log, but text-based logs alone are insufficient. Sometimes a quick overview is needed to influence stakeholders, keep the team informed, or enable effective prioritisation. As with team productivity, using graphs to display up-to-date information about technical debt can encourage regular review, prioritisation, and pay down. Technical debt should be on the agenda during sprint planning and backlog refinement, not ignored until the project end.
8. Develop an Onboarding Process for New Team Members
Building an enterprise-scale SDLC is a long running, iterative process. So, it’s highly likely that the team you start out with will not be the team you finish with.
As new members join the team, it’s important that they get up to speed fast, becoming productive as quickly as possible. Introducing people to the context and codebase of a new SDLC platform can be difficult. But a well-defined, frequently rehearsed, up-to-date onboarding process makes it more straightforward.
Onboarding should cover fundamentals such as groups, coding, architectural practices and standards, and the way features are implemented. It’s not necessary to reproduce what’s already available from hyperscale cloud training providers, but it’s important to convey how features are implemented in this environment. Review the process regularly, ensure feedback from new team members is looped in, and update the content as new technology is deployed to the platform. It’s much better to build and improve the onboarding process during the programme than to leave it until the end.
Benefits will be felt by the build team and the operational team alike. New engineers joining the build team will become productive more quickly, and handover to operations will also be smoother. Off-boarding – or a staff exit process – should also be included to maintain platform security.
Unlocking the Benefits of Cloud Platform Engineering
With their bespoke nature, platforms for SDLC are highly complex and difficult to build. But when it comes to enterprise-scale cloud computing, the benefits outweigh the risks. Especially so when the build is handled by an experienced, multi-disciplined team.
DevOps-enabled cloud platform engineering is a powerful approach that removes the traditional infrastructure constraints felt by development teams. It also provides a tailored base for CI/CD application development and deployment as well as enhancing developer productivity and agility. Together, DevOps and automation practices improve application quality and uptime while enabling more rapid and responsive delivery of business value.
As clouds mature and provide more features, practices also evolve. Cloud engineers are always learning. The eight lessons listed here are just some of the learnings we took from a recent engagement. We hope this post helps you develop your own cloud practices, stimulates curiosity, or validates some of your personal experiences. Maybe it raises new questions too. We’d be happy to hear them.
Ashley Petherick is a Solutions Enablement Consultant and has been implementing and managing internet based platforms for over 20 years. Over the last 6 years with Sourced Group, he has co-lead DevOps engineering teams to deliver cloud based platforms via code and pipelines, providing secure, scalable and stable SDLC structures.