I recently ran into the following error when trying to copy data from a local Hadoop cluster into Amazon S3:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ hadoop distcp -i / s3n://USERID:SECRETKEY@BUCKETNAME/
11/04/13 07:15:31 INFO tools.DistCp: srcPaths=[/]
11/04/13 07:15:31 INFO tools.DistCp: destPath=s3n://USERID:SECRETKEY@BUCKETNAME/
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.lang.NullPointerException
    at org.apache.hadoop.tools.DistCp.makeRelative(DistCp.java:901)
    at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1059)
    at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:884)

After some googling, I discovered that this turned out to be a bug in Hadoop 0.20. A fix for Hadoop 0.21 is already submitted, but what do I do in the meantime?

It took quite a bit of puzzling, but I eventually figured out that this is only a bug in distcp. In particular, this bug does NOT apply to hadoop fs -cp. So we can solve this with a little command-line fu:

1
2
3
4
for p in `hadoop dfs -ls / | cut -c 66-100`;
do
    hadoop fs -cp $p s3n://USERID:SECRETKEY@BUCKETNAME$p;
done;

Subscribe to the mailing list