+373 (69) 210 189 | info@fusionworks.md
bclose

Working with HDFS deployed on EC2 instances

Disclaimer: This blogpost does not care about security reasons. Its goal is to simplify testing against HDFS deployed on EC2.

Problem

In HDFS when deployed on EC2, NameNode communicates with DataNodes via private IPs or domain names. It also exposes them to the client when performing read/write operations. If your client is inside EC2(it is usually going to be there in production) you are good, but once you get out of EC2 network(for development and testing) – you are in trouble: your laptop does not know anything about private EC2 addresses. It can only communicate to instances via public IPs or domain names.

Solution

Use SOCKS tunnel! The long story is written here. Below is the short story:
1. Add the following to your core-site.xml:
[xml] <property>
<name>hadoophack.tunnel.port</name>
<value>2600</value>
</property>
<property>
<name>hadoop.socks.server</name>
<value>localhost:${hadoophack.tunnel.port}</value>
</property>
<property>
<name>hadoop.rpc.socket.factory.class.default</name>
<value>org.apache.hadoop.net.SocksSocketFactory</value>
</property>
[/xml] 2. Open tunnel using hadoophack.tunnel.port. Personally I used Putty as described <a href="http://www.virtualroadside.com/blog/index hop over to this website.php/2007/04/12/dynamic-socks-proxy-using-putty/”>here. But you may also use pure ssh: ssh -D 2600 _username_@_hadoop_namenode_host_

PS: tested with freshly released HDP 2.0.5 installed via Ambari.