At a glance

2014 - 2017
Build & migration

The purpose

To migrate our customer to AWS and deliver benefits such as availability, durability and performance of the platform while also reducing the overall total cost of ownership.

Reduced total cost of ownership by 72%

By rendering the disaster recovery environment redundant and shutting down resources when not required, Salsa was able to reduce the total cost of ownership for the platform by 72%.

Improved availability and durability

By balancing resources across two Availability Zones and building resources to automatically heal themselves, Salsa was able to dramatically improve the availability and durability of the platform.

Improved performance and scalability

Salsa boosted the performance and scalability of the platform by designing resources to scale to meet demand as required. This also reduced the overall application footprint.

Leveraging the cloud - architected for AWS

Salsa leveraged several Amazon Web Services to improve the platform: Compute and Networking, Storage and Content Delivery, Database, Deployment and Management, and Application Services.

AWS Architecture

Increased agent efficiency by 600%+

The solution created by Salsa allowed for the near real-time collection and analysis of data from Agents

1 Background

Salsa's approach was to adopt a typical 3-tier enterprise web application to the solution. A web tier enables the administration and management of the data collection tasks, an app tier handles the dispatch, collection and analysis of tasks, all backed by a MySQL database.

1.1 Task creation and dispatch

Operators of the web application can create campaigns of tasks to be completed. The application tier processes these campaigns and packages them into job sets for each Agent.

1.2 iPhone app

Salsa created a custom iPhone App that receives the messages and allows the agent to perform their tasks and record the results digitally. The results of the data collection are then returned to the application tier for verification and analysis. The App functions entirely offline with a local database due to difficulties with mobile reception in many areas.

1.3 Verification, analysis and reporting

The results from the Agent's activities are verified by the application tier and any discrepancies identified. If necessary, it will automatically create additional tasks and dispatch them to agents until it is confident in the accuracy of the results.

Once verified, analysis of the results is undertaken and a regular export is made available for the customer via a B2B interface. Regular data exports and reports are returned to the client at the end of each campaign.

1.4 Infrastructure

The initial infrastructure procured for the live pilot by the customer included two web servers to function as the web and application tiers, backed by two MySQL servers (one master and one slave). This infrastructure was fixed and running full-time. It was replicated into a User Acceptance environment and a testing environment. A lighter weight development environment was also built.

To improve availability, a cold disaster recovery environment was provisioned with specifications identical to the production environment. This environment was running full-time but was not used until a manual failover was initiated from the primary infrastructure. This failover process could take several hours.

2 AWS infrastructure migration

2.1 The opportunity

During the live trial phase, Salsa identified an opportunity to reduce the Total Cost of Ownership and dramatically improve the overall availability and durability of the application by migrating the existing fixed infrastructure to the more flexible utility model offered by Amazon Web Services.

The AWS migration allowed Salsa to reduce the costs to our customer by:

  • powering off resources when not in use;
  • removing the need for a cold disaster recovery environment;
  • reducing the base instance size to that required to meet the average demand; and
  • scaling the infrastructure automatically to meet peak demand.

Additionally, the availability and durability of the application would be improved by:

  • replacing fixed infrastructure in a single data centre with flexible infrastructure split across multiple geographically disparate Availability Zones;
  • leveraging the platform capabilities to allow infrastructure to "self-heal";
  • take advantage of Amazon Web Services' global infrastructure to provide disaster recovery options in other geographic regions such as Singapore, Tokyo or the US;
  • storing database and instance backups in the cloud, enabling faster restore; and
  • re-architecting the application to take advantage of all AWS has to offer.

Salsa could also deliver performance improvements by:

  • replacing the fixed application tier servers with a scalable pool of worker nodes;
  • ensuring that web and worker nodes can automatically scale tomeet peak application loads; and
  • incorporating AWS platform features to development and diagnostic workflows, enabling faster turn-around.

2.2 Security

Amazon Web Services' transparent approach to security was in stark contrast to the opaque setup of the existing infrastructure vendor. The security compliances that AWS has gained were instrumental in allying customer concerns, and the AWS team were able to answer many specific additional security questions that the customer had.

On top of the AWS physical security and compliance, Salsa was able to use AWS platform features such as VPC to increase the network security and build hardened templates on which to base the machine images.

2.3 Ease of migration

A key driver of the proposal was the ease of migrating the existing application stack into the AWS infrastructure. The selection of supported Amazon Machine Images (AMI) from the major Linux distributions allowed Salsa to have a proof of concept application up and running within a matter of hours, demonstrating the capabilities of the platform to our customer.

Similarly, an AWS Relational Database Service (RDS) instance was provisioned and available for connection from existing MySQL tools to populate sample data.

2.4 Reducing costs

Salsa built a simple Command and Control service to live within its infrastructure. One of its key responsibilities lies in the starting and stopping of EC2 instances as required. This allows for the balancing of resource availability with the goal of reducing costs; EC2 instances can be automatically started and stopped on a schedule according to simple metadata attached to the instance.

2.5 Going live

The nature of the Amazon Web Services infrastructure allowed Salsa to run several development, testing and production environments for less than the cost of a single fixed environment under the existing provider. In this manner the new platform could be thoroughly tested and benchmarked to ensure compatibility and performance prior to flicking the switch.

Outages were minimised by limiting the live migration to replicating database changes to the new production environment, and reducing the Time to Live (TTL) on the DNS records prior to the change over.

3 Leveraging Amazon Web Services

3.1 Amazon Elastic Compute Cloud

Underpinning most other Amazon Web Services products is the Elastic Compute Cloud (EC2). Effectively virtual servers in the cloud, Salsa utilized many features of the EC2 service to power the core of the application.

Machine instances

A single Amazon Machine Image (AMI) was built based on Ubuntu that includes Apache, PHP and all required dependencies. When booting as an EC2 instance, custom scripts in the AMI interrogate the metadata provided by the Auto Scaling service and direct the instance to configure itself for its purpose: a web server or application tier worker; and to start serving requests.

These instances are backed by Amazon Web Services' Elastic Block Store (EBS), which provides block-level access to disks on the instances, but enables the simple snapshotting of live volumes and fine-grained tuning of performance and volume size.

AWS - EC2 Instance Contents

Figure 1. EC2 Instance Contents

Command and control

A single small EC2 instance, running custom software written by Salsa, manages the EC2 infrastructure automatically. By editing simple metadata attached to each EC2 instance, the Command and Control server will power on and shutdown non-production environments as required. As such, most EC2 resources are only powered on during business hours, saving our customer the cost of 103 instance hours per instance per week, or 61% of the costs of running full-time, fixed environments. Additionally, the Command and Control service is responsible for the taking regular snapshots of critical, non-generic infrastructure, such as the command and control service itself, and the interface service, as well as regular snapshots of the database instances.

3.2 Amazon Relational Database Services

Salsa utilised the Amazon Relational Database Service (RDS) to provide the MySQL layer of the application. RDS instances are managed deployments of MySQL, Oracle or Microsoft SQL servers that are configured and controlled by a web service backed by the AWS cloud infrastructure.

RDS allowed Salsa to provide a standby MySQL instance in a different Availability Zone, with automatic replication and failover protecting in the event of hardware failure in the primary instance. The work of configuring MySQL slaves, typically several hours, is reduced by a single configuration option on the RDS Management Console.

Similarly automatic backups were configured to be kept for 7 days, allowing point-in-time restoration to any time within the last week. Beyond that, manual snapshots are taken once a week by the Command and Control server and kept indefinitely. This allows Salsa to provide our customer with the best balance between data availability for disaster recovery and cost effectiveness.

AWS RDS Architecture Design

Figure 2. Amazon RDS Architecture Design

Salsa identified a minor adjustment to the way MySQL was utilised previously versus the RDS implementation. That is, the MySQL server’s time zone is set to UTC. This was overcome by creating a three line stored procedure to reset the time zone for each connection, then enabling that to run automatically for all connecting clients using the RDS Parameter Group functionality.

Subsequently, Salsa undertook performance tuning of the RDS instances by utilising the built-in CloudWatch metrics, and several adjustments to buffer sizes and other performance parameters were made and easily applied to all running instances.

With the release of MySQL 5.6 on Amazon RDS, further availability enhancements can be made to support the automatic failover to a different AWS region such as Singapore or Tokyo, if required.

3.3 Amazon Virtual Private Cloud

Salsa established a Virtual Private Cloud (VPC) within the AWS infrastructure to protect network access to internal servers. Four subnets were setup within the Sydney region: one public and one private subnet for each Availability Zone.

AWS VPC Architecture Design

Figure 3. Amazon VPC architecture design

For increased network security, no application instances are directly accessible from the Internet. With access to the web instances managed via the Elastic Load Balancing service. The RDS instances are likewise provisioned directly into the private VPC subnets. Salsa maintains secure access to the VPC through a direct Virtual Private Network connection to the office.

NAT instances allow the private resources to access the Internet, including the AWS APIs, without exposing them to public traffic.

3.4 Amazon Route 53

Route 53 is Amazon’s Domain Name Service (DNS) implementation. Backed by a powerful web service, it can be used to provide functionality beyond that of a traditional DNS setup. Salsas configured each instance to name it based on its environment and purpose and then register a friendly DNS record upon boot. This allows engineers toaddress EC2 instances by name (such as prod-web01), without having tolookup IP addresses. This occurs automatically and always accurately reflects the running infrastructure.

Similarly, friendly DNS names are used to address other EC2 resources, such as ELB and RDS instances. Low Time-to-Live (TTL) values allow quick manual failover if required and provide a level of abstraction.For example, MySQL databases are often replaced in non- production environments for testing purposes. Route 53 allows Salsa to change one DNS record and direct traffic to the new database. For example, this reduces the time required to update non-production environments with production data from 8 hours under the old fixed infrastructure to 15 minutes under AWS.

3.5 Elastic load balancing

Salsa utilises the Elastic Load Balancing service to provide a level of workload distribution and improved availability. Creating an ELB instance across multiple Availability Zones creates two levels of redundancy: there are multiple instances of the ELB instance itself, balanced using DNS round-robin on Route 53, plus the ELB instances will route all incoming HTTP and HTTPS traffic to a healthy web instance within the same Availability Zone.

The health of instances are regularly polled by the ELB instances. HTTP polling allows the ELB to determine the health of the web application on each node itself, not just the instance health.

3.6 Auto scaling

Auto Scaling is the magic that provides the elasticity to the application. Salsa configured Auto Scaling groups for the productionweb and worker nodes. Each group spans multiple Availability Zones and allows for the automatic replacement of unhealthy nodes. Using the Amazon Machine Image (AMI), new instances can be automatically started in response to an unhealthy instance, or a CloudWatch Alarm. For this application, the size of the web instance Auto Scaling group will expand and contract in line with the CPU and Memory usage of the web instances; and the size of the worker node pool will likewise adapt to the number of outstanding items in the SQS Queue.

AWS Auto Scaling Architecture Design

Figure 4. Auto scaling architecture design

Auto Scaling groups also feature heavily in the deployment of new versions of the application. Like all AWS features, the Auto Scaling service is addressable via an API; this allows custom deployment scripts to create new Launch Configurations based on a new AMI version and automatically restart production instances without impacting application availability.

3.7 Amazon CloudWatch

Out of the box, AWS provides CloudWatch metrics for a long list of their own services. This includes health and hypervisor-level utilisation metrics for EC2 instances, plus in-depth utilizationmetrics for RDS and ELB instances, for example. This level of detail has allowed Salsa to continue to proactively monitor the environment, both for potential bottlenecks or problems, and to analyse and optimise performance. This has lead to reduced costs for our customer through right sizing of over-provisioned resources.

Similarly, support for custom metrics within CloudWatch has enabled a level of visibility into the application performance that was previously unreachable without purchase of additional monitoring software. Such metrics include the incoming and outgoing queue statistics, task execution times, calculation statistics as well as import and export details. With this information, Salsa is able to accurately identify changes in the usage patterns of the application and to recommend changes to the operation of the system to get best value for money for our customer.

3.8 Amazon S3

Amazon S3 provides a centralised and highly durable storage location for interfaces with third parties. Salsa provides an SFTP to S3 gateway to allow legacy interfaces to operate within the cloud environment. From S3, each worker node or web instance is able to access the common storage environment and process interface files as required.

AWS S3 Architecture Design

Figure 5. Amazon S3 architecture design

The security capabilities of the Amazon S3 infrastructure allow Salsa to build an additional level of security in to the distribution of the iPhone agent application. Because of the difficulties with extending browser sessions to the Over The Air distribution mechanisms builtinto the iPhone, it has been prohibitively difficult to secure access to the iPhone app binaries. Through S3 it is trivially easy to generate pre-authenticated URLs to access the secure bucket and download the app binary. These URLs are configured to expire within a few minutes of generation, so they cannot be easily shared with or captured by unauthorised third parties.

3.9 Amazon Simple Queuing Service

Amazon's Simple Queuing Service has dramatically influenced the design of the interface with the iPhone app.

Upon packaging, new tasks for agents are sent to a secure SQS Queue that only they have access to. The iPhone app will regularly poll the queue for new messages and process them accordingly. Similarly, the results from agent activities are sent directly to an SQS Queue instead of the web application. This allows for the asynchronous processing of incoming and outgoing messages; agent's iPhones are not left waiting for the server processing to finish, and floods of messages never impact server performance. Should the incoming message queue grow, CloudWatch and Auto Scaling allow the worker nodes to increase to meet the additional demand.

AWS Messaging Architecture Design

Figure 6. Messaging architecture design

The key change to the application was the move from a fixed application tier to a flexible worker pool. Using a custom system built by Salsa, the worker nodes will elect a master node amongst themselves. This master node will then schedule tasks as required according to a crontab-like list and post tasks to an SQS Queue. Each worker node will regularly poll the SQS Queue for new tasks to action. The message visibility feature of SQS works well as a traditional locking mechanism: each worker simple refreshes the visibility timeout of the message it is processing. Should a worker fail to complete processing the message is made visible again, allowing another worker node to pick up and complete the task. Workers and the master node aredesigned to fail; each is replaced automatically and new master elections occur every 60 seconds.

Tasks can similarly schedule the processing of other tasks in the SQS Queue, allowing complex tasks to be chained in a workflow-style process, with each sub-task being queued and distributed across the worker node pool.

Figure 7.

AWS Workload Distribution Architecture Design

4 Results

4.1 Total cost of ownership reduced by 72%

Through the migration to Amazon Web Services Salsa was able to eliminate waste and re-architect the application to make full use of the available infrastructure, and to shut down resources when they were not needed.

Traditional Host Amazon Web Services Saving
Production $ 58,098 $ 19,333 $ 38,765 (66%)
Testing / UAT $ 65,088 $ 24, 895 $ 40,193 (61%)
Development $ 6,000 $ 5,413 $ 587 (9%)
Disaster Recovery $ 47,916 $ 0.00 $ 47,916 (100%)
Total $ 177,102 $ 49,641 $ 127,461 (72%)

Table 1. Yearly infrastructure costs

Prices are correct at time of writing, are in Australian Dollars ($) and do not include GST.

4.2 Improved availability and durability

Despite the elimination of the Disaster Recovery environment, Salsa has been able to improve the overall availability of the application by splitting resources that were previously clustered in a single rack in a single data centre across geographically disparate Availability Zones without the provision of wasted infrastructure. The application can sustain up to and including the failure of an entire Availability Zone automatically and without outage to our customer.

The replacement of fixed virtual machines with flexible EC2 instances has dramatically improved the durability of the application. Hardware or software failures no longer need to be thoroughly investigated and fixed to restore the application to a working state. Hardware and software faults are dealt with automatically by the Auto Scaling service. Often, the application will have repaired itself within minutes, before engineers are aware of the issue.

4.3 Improved performance and scalability

Likewise, CloudWatch and the Auto Scaling services provided Salsa with the opportunity to re-architect the application for the cloud. Peak system demands on web servers result in the automatic starting and stopping of additional web instances, ensuring customers are not impacted by additional load.

The move to a flexible worker pool that can expand and contract according to workload has allowed Salsa to architect the next version of the application for our customer with confidence that the infrastructure demands can be met. Indeed, the new flexible workload management has allowed a 200% increase in message throughput whilst simultaneously supporting the reduction in virtual machine specifications to half that of the original fixed offering.

4.4 Leveraging the cloud – architected for AWS

By taking full advantage of Amazon Web Services' cloud offerings, Salsa was able to deliver an enterprise platform for our customer's business in a shorter timeframe. Hardware lead and setup times on a traditional hosting provider were negated by AWS' quick spin up times for new resources.

In all, ten different AWS services were used to deliver the solution across all of the service categories: Compute and Networking, Storage and Content Delivery, Database, Deployment and Management, and Application Services.

4.5 Increased agent efficiency by 600%+

Through the usage of Salsa's digital data collection solution and the Amazon Web Services platform our customer was able to increase their efficiency by 600%+ over traditional paper based processes. This has resulted in a faster turnaround for their clients, and a reduced workforce.