scala - hadoop distcp doesnot create folder when we pass single file -
i facing below issues in hadoop distcp suggestion or highly appreciated.
i trying copy data google cloud platform amazon s3
1) when have multiple files copy source destination (this work fine)
val sourcefile : string = "gs://xxxx_-abc_account2621/abc_account2621_click_20170616*.csv.gz [multiple files copy (we have * in file name)] output: s3://s3bucketname/xxx/xxxx/clientid=account2621/date=2017-08-18/ files in above path abc_account2621_click_2017061612_20170617_005852_572560033.csv.gz abc_account2621_click_2017061616_20170617_045654_572608350.csv.gz abc_account2621_click_2017061622_20170617_103107_572684922.csv.gz abc_account2621_click_2017061623_20170617_120235_572705834.csv.gz
2) when have 1 file copy source destination (issue)
val sourcefile : string = "gs://xxxx_-abc_account2621/abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz output:s3://s3bucketname/xxx/xxxx/clientid=account2621/ files in above path date=2017-08-18 (directory replace file content , doesn't have file type)
code:
def main(args: array[string]): unit = { val array(environment,customer, typesoftables, clientid, filedate) = args.take(5) val s3path: string = customer + "/" + typesoftables + "/" + "clientid=" + clientid + "/" + "date=" + filedate + "/" val sourcefile : string = "gs://xxxx_-abc_account2621//abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz" val destination: string = "s3n://s3bucketname/" + s3path println(sourcefile) println(destination) val filepaths: array[string] = array(sourcefile, destination) executedistcp(filepaths) } def executedistcp(filepaths : array[string]) { val conf: configuration = new configuration() conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem") conf.set("fs.abstractfilesystem.gs.impl","com.google.cloud.hadoop.fs.gcs.googlehadoopfs") conf.set("google.cloud.auth.service.account.enable", "true") conf.set("fs.gs.project.id", "xxxx-xxxx") conf.set("google.cloud.auth.service.account.json.keyfile","/tmp/xxxxx.json") conf.set("fs.s3n.awsaccesskeyid", "xxxxxxxxxxxx") conf.set("fs.s3n.awssecretaccesskey","xxxxxxxxxxxxxx") conf.set("mapreduce.application.classpath","$hadoop_mapred_home/share/hadoop/mapreduce/*,$hadoop_mapred_home/share/hadoop/mapreduce/lib/* ,/usr/lib/hadoop-lzo/lib/*,/usr/share/aws/emr/emrfs/conf,/usr/share/aws/emr/emrfs/lib/*,/usr/share/aws/emr/emrfs/auxlib/*,/usr/share/aws/emr/lib/*,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,/usr/share/aws/emr/cloudwatch-sink/lib/*,/usr/share/aws/aws-java-sdk/*,/tmp/gcs-connector-latest-hadoop2.jar") conf.set("hadoop_classpath","$hadoop_classpath:/tmp/gcs-connector-latest-hadoop2.jar") val outputdir: path = new path(filepaths(1)) outputdir.getfilesystem(conf).delete(outputdir, true) val distcp: distcp = new distcp(conf,null) toolrunner.run(distcp, filepaths) } }
by adding below code above issue fixed
code
val makedir: path = new path(filepaths(1)) makedir.getfilesystem(conf).mkdirs(makedir)
Comments
Post a Comment