35 days. As part of benchmarking our ResultSet compression protocol, we wanted to set up our own private network on AWS. 35 days. That’s the number of working days it took for us to get a working virtual private cloud (VPC) on AWS. Why so long you ask? Great question! This post will describe some of the pre-requisite knowledge you would need to create a VPC, and the lessons we learned.
Classless Inter-Domain Routing (CIDR)
CIDR blocks are used to allocate IP addresses and route packets. Quick question: what is the difference between 10.0.5.0/23 and 10.0.5.0/16? The former says “allocate 2**(32–23) IP addresses starting from 10.0.5.0”. This will have IP addresses from 10.0.5.0 to 10.0.5.255 and then 10.0.6.0 to 10.0.6.255, totalling 512 IP addresses. The latter would say a similar thing but allocate many more IP addresses (2**16). The prefix /x decides the size of your subnet.
Route tables decide how packets are routed within your network. As part of your VPC, you should make sure that there is a rule that mentions where the packets sent from your local network to your VPC would end up, and what about packets that are from another source outside your local subnet. It is good practice to have rules that are fine-tuned. E.g do not have rules for all packets over the Internet (0.0.0.0/0).
A.k.a firewalls. Security groups are a set of rules you define that determine what packets come in and out of your network on AWS. By default, it is set up to reject all traffic. You can add traffic on SSH (22), and allow packets that come from your network (say, you are going to access your cluster from your corporate network IP 188.8.131.52/24) on some ports (e.g all TCP traffic on port 8080, and all ICMP traffic). These rules can be as fine or as coarse as you wish.
These work similar to security groups. We used them initially but after we reported our problem to AWS, they suggested not to use ACLs as some other VPC users had also had problems with ACLs. So, our suggestion w.r.t ACLs: don’t use them.
Creating a VPC
Amazon Web Services (AWS) offers a lot of possibilities for deploying your applications on the “cloud”. Out of the box, all instances, launched as part of an account created after Aug, 2013, are put as part of a default Virtual Private Cloud (VPC) by AWS. What is a VPC, you ask?
A VPC is your own private network on AWS. As you can expect, AWS has a lot of nodes and when you create a cluster of 8 nodes, there are chances that in your vicinity, a lot of other nodes are also operating. A default cluster operates in a VPC with a capacity of close to 5000 nodes. This is a very large network to operate and chances are high that your traffic is operating with other traffic in the same path. When benchmarking for performance, one would prefer to have a sanitized network that only has your traffic. A great way to do this is to setup your own VPC on AWS, with your set of route tables, firewall rules and nodes. You can configure it just like you would your on-premise network, with route tables, network ACLs, VPN connections, create your own subnets with your CIDRs, and treat it as your requirements demand.
There are 4 scenarios that your VPC can operate under. In this post, we will talk about scenario 4, where all access to your VPC goes via a VPN connection. Scenario 1 is for a completely public facing setup which is not what we want. Scenario 2 is a public/private subnet scenario and truth be told, we felt that this scenario is most well-suited and also created VPC using it. But, we realized that we don’t have any public facing content (e.g a static website being hosted) and all the private resources in the private subnet were harder to reach. To explain the last point more, a node in the private subnet is not directly ssh-able or ping-able. One needs to create a bastion node in the public subnet and then use that to access the private one. This is doable when it comes to Linux but doing it for Windows is very slow. For the same reason, we also rejected scenario 3. Scenario 4 also fits our requirements perfectly: we want control over the users who can access the network and we also want a completely private network as it is mainly for benchmarking.
Following the packets
ssh -A firstname.lastname@example.org
Connection timed out
request timed out
request timed out
request timed out
What!! We have set up the route table, the IP address is correct, the internal Simba firewall shows that the packets are leaving our network, what could be wrong?! Aargh.
Oh I know! We can do a traceroute and that will tell us what’s going on!
. . . . . .
So it is leaving our firewall gateway machine but not able to reach AWS machines!Oh, maybe the security group is not allowing traffic! That must be it! Let’s see what the rules are
ALL ALL 0.0.0.0/0 0.0.0.0/0
Hmm, no. The security group is allowing all traffic on SSH (22) and all ICMP traffic, no questions asked. Hey, what about route tables?! Maybe, it is not accepting packets from our subnet?
It accepts all packets coming from the subnet block 184.108.40.206/24. I don’t trust this, we should confirm this with firewall support. They can double check our results.
Firewall support confirms that the packets are in fact, leaving our network and the VPN tunnels are up.
Hmm, what could go wrong then?!
We tried our best to resolve this ourselves and then called for AWS support. What we availed was a “developer” plan, which costs $49 a month and allows you create support cases and guarantees <12 hour response (time-zone specific). We emailed AWS support and here is what happened.
Problem: Unable to ping a private instance with IP, 10.10.13.136.
- They asked us to stop using ACLs. So we stop using them and just set it back to default.
- This did not help in accessing the instance, so we were asked to assign an Elastic IP to the instance. What is an Elastic IP? They are for dynamically associating an IP address to an instance. They help in addressing instance failures without affecting those who were using the failed instance. In this case, having an Elastic IP allows the instance to be directly accessible from the Internet. Why did we take this step? This is because the instance had to be accessed in one way or the other so that we could see what’s going on in that instance!
- This led to an interesting situation. 10.10.13.136 was still inaccessible from our private network but 10.10.13.136 could ping a machine inside our network! So traffic from A to B wasn’t possible but B to A was! This is now getting even more confusing!
- Remember how we mentioned that we are using scenario 4? Yes, that includes using a VPN tunnel. You create two tunnels so that one can act as backup. On “following the packets”, we realized that the packets were leaving Simba’s network using one tunnel but coming back using another. This can be a huge problem if your router does not support asynchronous routing. BUT! This is alright because our router does support asynchronous routing.
- This screenshare was on June 12th and on June 10th, 2015, AWS launched Cloudwatch logs. Cloudwatch allows you to monitor an instance you cannot reach by setting up a logging service on it and seeing it on your screen. We enabled this, it took a while because you had to create an IAM role to be able to use it. IAM is for managing who uses which services in an organization. It is one of those features of AWS that make it robust and manageable in a large organization. Moving on, we created an IAM profile just for Cloudwatch and then waited.
- The developer from AWS was very familiar with the firewall API. He asked us to run “vpn tu” on the command line of the firewall. This would compare the security parameter index of the VPN tunnel on both sides. The SPI was same on both sides.
- After this, we ran the command “ ‘Fw ctl zdebug + drop | grep 10.10.13.136” to see if any packets are being dropped. This showed that only packets from source IP address (220.127.116.11) were being sent. This address is the public IP of the firewall! The same was being shown on the flow logs setup above. Do you see what we saw?! The CIDR of the subnet block internally is 18.104.22.168/24. The route table was set up to *only* accept packets from that block. The source IP on the flow logs and also on the firewall command result was showing 22.214.171.124, which was the *public* address of the firewall. See why they might be getting rejected?! It was NAT’ing the IP address! Regardless of the *private* IP that was trying to access the AWS node from Simba, our firewall was converting that to the public IP that is seen from the outside world. This meant that the network, rightfully, was rejecting these packets as it was explicitly asked to only accept packets that had the IP of our private network.
- To confirm that this was the issue, we added a static route to the public IP 126.96.36.199. Now the ping started to work.
- This is not a permanent solution. We wanted to access the nodes from our private IP, not use the NAT IP. To do this, we need to add a rule to our firewall which stops NAT’ing our IP address. So we did that. What this rule says is that “Hey firewall, when the instances in the IP range 10.0.5.0 to 10.0.6.255 are being pinged/ssh’ed to from our private network, do not NAT them!”
- After this, we removed the temporary static route added to 188.8.131.52 and the ping still worked, thus solving our initial problem.
- But, but, but, the VPC we setup can talk to Simba’s machines only by using the VPN tunnel. What about talking to the public Internet? Typically, these networks do not need to talk to the Internet, hence they are kept completely private. But, our aim with this network is to use it for Hive benchmarking, hence they need access to the public Internet once to setup the relevant services on them. To achieve this, we added another rule to our firewall which let those instances access the public Internet but only by using our corporate network. This makes it slow, because the packets are always routed via the tunnel but gives us greater security. And as it’s a one time operation, the reduction in download speed is alright.
And this is how we setup our own VPC on AWS.
Setting up a VPC on AWS is some work but the pay off is that you get a network tailored to your configuration and with the confidence that traffic from other networks is not interfering with your packets. You can also setup smaller subnets that are specific to the projects you are working on and choose the same tunnel to tunnel your packets through. Having said that, choose carefully when deciding which scenario to choose for the VPC as it’s impossible to change it once created. If you are sure that the network will always remain private, then scenario 4 is justified. If you want to deploy a public facing website, then scenario 2 might be more suitable. As a last note, AWS support was fantastic and we would highly recommend getting it when trying to setup large clusters on AWS.