This article is a cautionary tale for people using Amazon Web Services (AWS) and a testament to the awesome amazon customer support.
Thu, Jul 3, 2014 at 8:58 AM: The Shock
The day started bright with a light cloud cover and cool refreshing breeze. A rare and lovely day in the blistering summers of Cairo. I was hacking away in the quite morning hours when I noticed that the month’s AWS bill has arrived. I took a look and was rattled to see that it was for over 1800 USD!
Let me put you in the picture. For reasons beyond the scope of this post, we opted to build our app architecture on AWS. Let’s just say that on a typical month we pay much much less than the quoted 1800USD. Typically our costs are roughly broken down as follows:
- 50% for EC2 : which we use to hold our HTTP servers, task queues, search index, map-reduce cluster and internal analytics.
- 20% for S3 & CloudFront, which we use to serve images.
- 30% for Data transfer, which is mainly for the website bandwidth.
After going through the 5 stages of grief, I went on to check the billing report, which showed that the reason for this spike in costs was due to “data transfer out”.
Yes, you are seeing this correctly, the servers transmitted around 14,785 GB of data (and yes, that’s around 14.4 TB of data). For a delusional moment there I thought we must be serving Google sized traffic!! To put it relatively, according to this bill, our servers transferred out 250 times more data in June than they did in May! Needless to mention, this doesn’t map to a 250 fold growth in traffic .. oh, we wish!
What on earth consumed this huge bandwidth?
We did a quick first investigation but came out empty handed. The data we got from Google analytics did not account for this kind of traffic. Neither did our internal access logs from nginx, nor the logs for search queries through elasticsearch. We doubted a denial of service attack, but the volume of requests per month (as reported by our load balancer) were not a match either.
Things just didn’t make sense!!
Looking deeper in our usage, two resources (provided by AWS) were very helpful:
- AWS Usage report. This shows us the exact usage for each of the AWS services, by day and hour.
- AWS Cost explorer. This visualizes the cost per service and allows to clearly monitor cost and usage day by day.
The outcome of our second investigation was this: There were huge data transfer spikes in the second week of June but the existential question remained: What caused these spikes?!
Eventually we we decided to do three things:
- To see if we can set up alerts in case similar spikes occurred in the future. This turned out to be possible by adding alerts through Cloud Watch.
- To contact amazon and see if they can shed some light on the situation.
- To top up our credit card so we can pay up! It is unfortunate and all but we don’t want to risk down time.
Thu, Jul 3, 2014 at 2:12 PM: The lengthy reply
A few hours after we got in touch with amazon, we got an answer. A very thorough answer. An 887 words long answer! In their reply, they asked a few questions about our usage of AWS. This is primarily since AWS support do not have access to the data of the account and its configuration.
They also pointed us to a few resources to help us with our investigation of the problem. Like where to look in the AWS Usage report, and documentation on security best practices for AWS.
So after responding to the questions, we were asked to wait until the case is reviewed.
Sat, Jul 5, 2014 at 8:01 AM: Amazon requests closing ports.
Upon reviewing, the main thing they pointed out was that the setup of our security groups left several ports open to the public web. In AWS, security groups is a way to configure the firewall (inbound and outbound traffic) for running servers.
It turns out that one of those open ports was the ICMP port! With this port open, the cause of the traffic surge could very likely be a ping flood. This is supported by the fact that logs for the running servers (nginx, unicorn, elasticsearch) did not show any proportional activity during the days with the I/O increase.
The other port that we were asked to close was 22 (SSH). At first I found this peculiar. How would we ssh to the servers freely, especially since the team is not always located on the same network.
On asking, I was referred to the AWS security best practices for ssh. The argument is SSH should not be globally accessed from anywhere, but only through trusted networks. The logic behind it, in relation to the data I/O spike, is that the authentication request itself causes I/O, which can be prevented. Currently we maintain the IP for each developer to allow their SSH access to the servers, and flush the configuration each week for good measure.
So we effectively closed both ports and asked AWS support for another review.
Tue, Jul 8, 2014 at 10:16 AM: Sigh of relief
We received the following email from amazon. In short, amazon are awesome! :)
Good Day Its Tiffany here again from AWS billing support. Once again Thank you for your patience and for working with us to resolve this case as well as to prevent any further unexpected charges from occurring on your account. As a one time courtesy I have gone ahead and waved your June bill and you are no longer liable to pay the charge of $1,872.46. I hope this helps. Please let me know if there is anything else that I can assist you with or if I should close this case. Have a lovely day further. Best regards, Tiffany O Amazon Web Services
- Take a good look at the default security polices. They may not always be the safest for your system.
- Even if a port is secure, close it to the public web if you don’t need it open. The authentication request itself adds data I/O overhead.
- Set cost alerts for your usage and keep an eye on the cost explorer.
We hope this will prove helpful to some of you guys out there. If you have any questions or wish to share your own experience, please drop us a comment below. Thanks for reading! :)