How-to: Migrate existing apps & Build new apps in AWS Cloud the right way
The Move to the Cloud ☁️
As more organizations start their Public Cloud journey, chances are they will either be migrating existing APIs/applications to the cloud, and even building new ones. If you are looking into leveraging AWS, you want to make sure that you choose the right technical stack to build your APIs/applications.
AWS is like a huge candy store, with ~100+ available services to choose from for building applications. As a responsible developer, you want to make sure that you identify and weigh the pros and cons of each particular AWS service. You will need to review things like: costs and pricing, security, availability and reliability, performance, integration with other applications and platforms, future growth/scalability options, etc.
These are aspects you should address as early as possible before drafting out any sort of architectural design models, or writing any code. Addressing these concerns initially will help enable you to have highly-reliable APIs/applications in the future that will help grow your business.
Step 0: Know your Business and Customer Requirements
This step is of utmost importance while considering AWS. Before you pick any AWS service, you need to know your business and customer requirements.
Some important considerations include:
- The Business flows your API/application will support
- The Customer group — who executes those business processes/flows and where from?
- Knowing what is the recurrence of every flow?
- Understanding the business impacts through quantifiable metrics, if for example, any of these flows quit working or malfunction for: 1 moment, 5 minutes, 60 minutes, 8 hours, 24 hours?
Having the answers to these customer points will give you the correct setting to decide on potential AWS services.
Step 1: Consider Performance and Future Scalability Options
This can be an expensive error if not given proper and thorough considerations. Applications are meant to grow over a period of time, either through additional calling applications, or added features. Although, in the early stages it can be difficult to preciesly pinpoint what numerical amount of resources/instances to allocate to a partircular application, it is always safe to consider the following points:
Is the service of our choosing scalable?
While assessing a specific AWS service, try to pinpoint those AWS resources, or confugration settings that will help drive scale for your applications. This will enable you to begin with the correct limit with regards to each AWS resource, and develop a procedure in place to scale your AWS infrastructure as your application’s utilization develops over time.
- Number of Provisioned EC2 instances
- Lambda function concurrency
- RDS read replicas
- EKS Cluster size
- Elastic Load Balancers
- Dynamo DB capacity units
Are there resource utilization limits?
AWS can appear like this infinite source of computing power, but there are limits. Specifically, limits applied to your company account.
Here are examples of some resource limit dimensions:
- Number of provisioned instances (Examples: EC2 instances, SQS Queues, CloudWatch Alarms, IAM Roles/Users, S3 Buckets, VPCs)
- Throughput (Examples: Concurrent Lambda executions, CloudWatch List/Describe/Put requests, SNS messages per second, Dynamo DB capacity units, EFS throughput in Gb/s)
- Storage size (example: EBS volume size limits)
- Data retention policy periods (example: CloudWatch metric logs, Kinesis streams, SQS message retention)
- Payload size (example: SQS messages, Dynamo DB items, Kinesis records, IoT messages, IAM policy size)
Keep in mind that some limits can be increased (for example, number of EC2 instances you can launch), while others cannot (for example, metric retention in CloudWatch). You need to ensure that there are no deal breakers for your application in the “cannot be increased” category. Otherwise you will pigeonhole yourself into one type of setup for a long time.
Step 2: Think about Availability
AWS has a pretty solid engineering support team that is always ready to prevent and fix all sorts of failure scenarios. But, failures in the AWS services will occur from time to time. You want to minimize the risk of decreased customer satisfaction, and lost revenue for your company. Therefore, you have to consider the options a specific AWS service gives you in order to properly handle failure.
It is also worth noting how difficult it is to set up those mechanisms. But totally worth it in the future.
What are some possible failure scenarios?
At a high-level, these are some types of failure situations you may have to deal with for a particular service:
- Cannot access existing AWS services — (Examples: Unable view existing S3 Buckets, can’t properly describe EC2 instances, increased latency due to high numbers of failures on an API)
- Cannot create new/additional AWS resources — (Examples: facing increased error rates and latency while launching EC2 instances)
- Failing connectivity within Availability Zones — (Examples: connectivity errors and/or increased rate of latency between EC2 instances in different/same Availability Zones, higher rates of errors between RDS master record and read replicas)
Before settling on a service, I suggest examining the common failure scenarios of a particular AWS service, and incorporating the recovery strategies into your proposed design. The AWS Service Health Dashboard is a good place to view the current status, as well as history of all AWS services region-wide. This will help you undertand the history of particular services, and the failures that have occurred in the past so you can be well-informed.
How can we mitigate some failure scenarios?
Use failover mechanisms to bring up your backup/redundant AWS resources online. Examples are: Using Application Load Balancing to deploy traffic to healthy instances, using health checks, Having autoscaling groups for fleets of EC2 instances, and Backup-region routing in case of Route53 DNS failures.
Backup and Redundancy Measures
Redundancy enables you to have a backup (or additional set) of your AWS resources that will kick in when there are failures. Examples are: Having a fleet of additional EC2 instances ready for traffic, replicated data in S3 buckets (or even S3 versioning-enabled buckets), RDS Read Replicas as mirror databases, and Cross-region data replication in DynamoDB.
Step 3: Implement Security controls
AWS provides a number of different methods to give you control and ensure your AWS resources and data are accessed and stored in a secure manner. IAM (Identity and Access Management) is the key and global method for authentication and permissions management inside of AWS. Aside from this one, there are also additional methods and tools that vary by service (Cognito, KMS, etc.)
Create Resource-based permission policies
By and large, everything accessed in AWS is controlled by roles and resource-based permission policies. Resource-based policies are assigned to entities (another AWS service like S3 buckets, SNS topic, person, user, admin, etc.) Resource-based policies enforce specific AWS accounts that are allowed to access a particular resource, and the methods (like GET, PUT, DELETE) that can be executed against them.
You also have IAM Roles that are assumed/assigned to users/resources which hold Resource-based permission policies. Roles carry flexibility in that they can be assumed by an entity on an as needed basis. And when no longer required/unauthorized, you can unassign the role and be done.
Enable CloudTrail logging
CloudTrail enables logging and auditing of nearly all AWS events. You can view API activity logs via CloudTrail, and find out details like which API methods are being called, the entity calling them, etc. This helps from a traceability perspective in giving you the paper trail of events in case something suspicious or like a breach occurs.
Encrypt your Data
Encryption in transit is usually readily available through using HTTPS. Encryption at rest is also a way of encrypting data while it’s stored in one place. Services like S3, Elastic Blockstore, Glacier, Elastic MapReduce, Redshift warehousing, and DynamoDB offer encryption at rest to protect stored data.
Step 4: Set up Proactive Monitoring and Alerting
Early incorporation of operational processes is extremely valuable. If you’re thinking about a particular AWS service, evaluating its features in this context will enable you to identify architecture and application components that will in turn make your operations easier. The earlier you look into this, the better. Monitoring is a critical part of operational procedures. I would also highly suggest knowing which metrics are available for a particular AWS service.
Enable CloudWatch monitoring
CloudWatch enables capturing of performance metrics. Every service under AWS generates CloudWatch metric logs that enable you to keep the system health in check, and take any sorts of corrective action required when something is not working as desired.
Performance metrics vary service to service, in terms of how often they are published. Example: API Gateway generates metrics per minute, EC2 service publishes metrics per minute (at an additional cost), and Elastic MapReduce publishes every five minutes.
You need to ensure that performance metric intervals make sense for your application, in order to effectively know how your product is performing from a health perspective.
CloudWatch also allows creation of dashboards to provide a visual and filter data based on performance metric filters, percentiles, and metric math (max, min, avg, etc) filters. Some metrics are not available by default through CloudWatch, so you may need to figure out a way to capture those data points. Example: EC2 Memory and Disk utilization is not a default metric available from CloudWatch.
Either way, a method of gathering performance metrics should be another key consideration so that you can be able to tell at any time how your application is performing, and address any concerns.
Appropriately manage incidents
In IT, incidents and unexpected behavior is commonplace. The key question is, when picking a service, how difficult/seamless will it be to deal with them?
Create CloudWatch Alarms and SNS Notifications
By default, all CloudWatch alarms have option to send SNS (Simple Notification Service) notifications. You have options on how you’d like to know about particular incidents once a notification is sent (examples: send an e-mail, push a message to the SQS Queue service, invoke a Lambda function, or call vendor service incident response plaform via HTTPS). At the minimum, you should set up some alarms and alerts.
Build realtime dashboards
This will help you monitor your infrastructure in realtime, and know about problems before they arise. Many service like DataDog, Splunk, Grafana, etc. integrate nicely with many AWS service and can be configured and stood up within minutes. This is a proactive measure.
Other points to consider include: if we choose this AWS service, how are we going to know if/when something is not right?
- Will the existing performance metrics from this service be sufficient to create a meaningful alert/alarm?
- What are the right time increments (one-minute or five-minute) to be able to tell if a CloudWatch alarm needs to be triggered?
- Do we need more responsive measures than just the alarm?
- Once the alarm is triggered, do I have the right infrastructure in place that will handle failover and effective remediation?
- Will I need to create new response methods if I choose to implement this particular service?
- If we pick this AWS service, how can we address issues through automated means?
Deployment methods and practices
This aspect is almost considered way late — how are we going to deploy this thing? Will we automate deployments? Use GitOps? Frequent deployments become more and more common as you grow your application, and gain additional consumers. Bad deployment practices can cause bottlenecks in your workflows and cause disruptions in shipping features out to market. Always try to automate deployments, and look into using GitOps for releases.
IaC — Terraform, CloudFormation, Pulumi, etc.
Terraform and CloudFormation are both AWS infrastructure provisioning tools that create entire project setups and deployments through automation in AWS. Terraform is cloud-provider agnostic, and CloudFormation is native to AWS. You can choose whichever one you prefer based on how much flexibility you need.
They both can have a learning curve, so it is important to understand and experiment with them to know how they work. It’s also important to know about how to do Rollbacks if necessary, as well as any automated processes/setup that you may need to run prior/after you run your Terraform or CloudFormation deployment.
Step 5: Know about your Costs
Determine the AWS price dimensions as linked to your application and AWS service
A lot of teams tend to disregard important price attributes of their applications. Example: Cost estimation of EC2 being primarily based on instance type, and not accounting for the accrued storage costs in EBS or data transfer costs.
Example: You’ve got a File-Sharing server using the EC2 t2.Large instance, and you’re only focused on the hourly compute costs for that type of instance. Alone, the compute charges for t2.Large will be around $70 per month (as of current date) in the N.Virginia region. Then, if say your server transfers one terabyte of data per month externally, your transfer cost can range anywhere between $90-$110 alone. One terabyte of Elastic BlockStore can cost around $100 per month. Imagine 2, 4, 10, 30…
If this type of oversight is made at a scale, you can be paying upwards of couple thousands of dollars in unforseen additional costs montly for this type of setup.
Identify and calculate price at scale
Try to evaluate AWS services based on price calculations for low and high usage of a particular service. What happens when your application sees more than expected levels of traffic? What if you experience less than you previously thought? A lot less?
There are some AWS services that cost less as usage increases. S3 is one of those, and it has different price ranges all depending on usage.
Lambda functions can be a double-edge sword in these scenarios. Some can be more expensive at 200 transactions per second, in comparison to an EKS Cluster containing 6 M3.large instances. Now, this also depends on the function type and intensity (if it takes 1200ms to run, and consumes up to 512MB of memory per execution), you could easily be paying upwards of $15,000 for your Lambda. But, if that same function volume-wise did not receive a relatively large number of calls, Lambda would be the prime option.
Originally published at https://tech.wahab2.com on December 7, 2020.
Want to learn more about AWS Cloud? Technical Architecture? 📝
Check out my series linked below! 🙂
Technical Architecture: 👇