Recently we have been fortunate to work with one of our enterprise clients on the development and launch of a mission critical system running on AWS. The system carries the highest criticality rating for the enterprise, requiring complete deployment automation using their existing AWS deployment pipeline, whilst maintaining the highest levels of availability and reliability.
One of the more challenging requirements of the engagement was to deploy a highly available database running Oracle 12c. In this post we will be discussing how we achieved this by deploying Oracle RDBMS on EC2, using a combination of Oracle Data Guard and HashiCorp Consul, while outlining some of the challenges we encountered during this project.
Why not AWS RDS?
In most cases, the platform selection for Oracle database on AWS would lead towards the use of the RDS Platform as a Service with its:
- High performance– fast and predictable with push button scaling;
- High availability and reliability– simple, automated backup and recovery capabilities including Multi-AZ deployments and automated snapshots; and
- Automation capabilities– automated patching and database event notification without human interaction.
Unfortunately this particular application required a number of Oracle RDBMS features that were not available with the AWS RDS. Specifically the application stack required two features – Oracle Spatial and Java stored procedures, that were not available with AWS RDS when we commenced the project. Additionally, the software vendor had also not certified their product for use with RDS, leaving us with the challenge of deploying and managing our own highly available Oracle capability.
Note: Please note that at the time of writing this article, Oracle Spatial is now available and supported on RDS.
Challenges with running Oracle on EC2
Although there are a number of documents and white papers that detail Advanced Architectures for running Oracle RDBMS in AWS, one of the biggest challenges that we faced was with the deployment automation and management of a tightly coupled and highly stateful database application, whilst observing the client’s desire to leverage auto-healing and blue/green release methodology. To achieve the desired outcome, we needed to introduce additional technologies to assist with managing the lifecycle of an Oracle database running on EC2.
Introduction to HashiCorp Consul
HashiCorp Consul is a highly available and distributed service discovery and KV store service.
To quote the HashiCorp website:
In this particular environment, a Consul server cluster is deployed as a three node cluster and although this component of the solution is out of scope for this article, you can find more information about how to do this yourself on HashiCorp’s Bootstrapping a Datacenter page.
Each client connecting to the Consul server cluster requires the Consul client to be installed, fortunately HashiCorp has simplified this process by providing a single binary that:
- Provides both server & client capabilities;
- Runs as a service; and
- Is available for wide range of platforms and operating systems.
Consul uses an agent-based model in which each node in the cluster runs as a Consul client. This client maintains a local cache that is regularly updated from the cluster servers and can also respond to queries by other nodes in the cluster; this allows Consul to do connection enforcement at the edge without always communicating to the central servers.
Within Consul itself, the configuration data is stored as an hierarchical key/value store, and based on events that occur on the clients, consul is able to distribute changes to other client nodes in the cluster enabling the rapid distribution of configuration updates.
To enable these changes to occur on the nodes, and within the applications that are running on them, the following Consul capabilities are used in this solution;
- Consul Templates – Consul Templates provide a convenient way to populate values from Consul into configuration files on the node; and
- Consul Watches – Consul Watches use queries to monitor the state of values within Consul for any configuration or health status updates and then invoke user specified scripts to handle the changes and reconfigure the nodes or hosted applications accordingly.
Using these two components of Consul, we are able to easily build reactive infrastructure that can dynamically reconfigure itself ensuring the health of a running application.
In the next section, we will demonstrate how this has been used to provide a highly available Oracle capability on EC2.
Using HashiCorp Consul to achieve high-availability for Oracle Database
To achieve a highly-available solution for Oracle Database, we required the following:
- Configured and enabled Oracle replication between two nodes, leveraging a fast network interconnect to ensure that there is no lag leading to data loss caused by a node failure;
- Oracle Data Guard observer needs to be running to ensure that the primary node’s database is in sync with the standby node. This functions by connecting to the standby database node and ensuring that replication and heartbeats are occurring from the primary node. The observer is configured externally to the database nodes. In this particular environment we have configured this role to be running on the consul leader; and
- A monitoring and orchestration capability that can execute commands on the Oracle nodes in the event of a failure to ensure that reconfiguration and promotion tasks are performed, such as when the primary node becomes unavailable.
When designing the solution, we also needed to factor in the following constraints within the client’s environment:
- AWS Load Balancers– Given the nature of Oracle in this particular environment, specifically in its use of regular long running queries (> 60 minutes), it was not possible to front the database with an ELB because of the maximum timeout period of 1 hour for TCP connections. Doing so would affect the result in connection termination and query failure; and
- DNS– The client’s approach to DNS records management and governance, prevented us from updating the records programmatically and carried some risk of the records becoming stale or unresolved during a network isolation event.
Fortunately, Consul’s service discovery and other capabilities were able to help us overcome these issues with the following design and implementation.
The following sections detail our technical implementation of the eventual solution with a high-level architecture diagram and detailed descriptions of how the solution is deployed using our client’s deployment pipeline.
High Level Architecture