scala - hadoop distcp doesnot create folder when we pass single file -

January 15, 2010

i facing below issues in hadoop distcp suggestion or highly appreciated.

i trying copy data google cloud platform amazon s3

1) when have multiple files copy source destination (this work fine)

  val sourcefile : string = "gs://xxxx_-abc_account2621/abc_account2621_click_20170616*.csv.gz [multiple files copy (we have * in file name)]    output: s3://s3bucketname/xxx/xxxx/clientid=account2621/date=2017-08-18/   files in above path   abc_account2621_click_2017061612_20170617_005852_572560033.csv.gz   abc_account2621_click_2017061616_20170617_045654_572608350.csv.gz   abc_account2621_click_2017061622_20170617_103107_572684922.csv.gz   abc_account2621_click_2017061623_20170617_120235_572705834.csv.gz

2) when have 1 file copy source destination (issue)

    val sourcefile : string = "gs://xxxx_-abc_account2621/abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz     output:s3://s3bucketname/xxx/xxxx/clientid=account2621/     files in above path     date=2017-08-18 (directory replace file content , doesn't have file type)

code:

 def main(args: array[string]): unit = {   val array(environment,customer, typesoftables, clientid, filedate) = args.take(5)   val s3path: string = customer + "/" + typesoftables + "/" + "clientid=" + clientid + "/" + "date=" + filedate + "/"   val sourcefile : string = "gs://xxxx_-abc_account2621//abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz"   val destination: string = "s3n://s3bucketname/" + s3path   println(sourcefile)  println(destination)   val filepaths: array[string] = array(sourcefile, destination)     executedistcp(filepaths)   }   def executedistcp(filepaths : array[string]) { val conf: configuration = new configuration()  conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem") conf.set("fs.abstractfilesystem.gs.impl","com.google.cloud.hadoop.fs.gcs.googlehadoopfs") conf.set("google.cloud.auth.service.account.enable", "true") conf.set("fs.gs.project.id", "xxxx-xxxx") conf.set("google.cloud.auth.service.account.json.keyfile","/tmp/xxxxx.json") conf.set("fs.s3n.awsaccesskeyid", "xxxxxxxxxxxx") conf.set("fs.s3n.awssecretaccesskey","xxxxxxxxxxxxxx")     conf.set("mapreduce.application.classpath","$hadoop_mapred_home/share/hadoop/mapreduce/*,$hadoop_mapred_home/share/hadoop/mapreduce/lib/* ,/usr/lib/hadoop-lzo/lib/*,/usr/share/aws/emr/emrfs/conf,/usr/share/aws/emr/emrfs/lib/*,/usr/share/aws/emr/emrfs/auxlib/*,/usr/share/aws/emr/lib/*,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,/usr/share/aws/emr/cloudwatch-sink/lib/*,/usr/share/aws/aws-java-sdk/*,/tmp/gcs-connector-latest-hadoop2.jar") conf.set("hadoop_classpath","$hadoop_classpath:/tmp/gcs-connector-latest-hadoop2.jar")   val outputdir: path = new path(filepaths(1)) outputdir.getfilesystem(conf).delete(outputdir, true)  val distcp: distcp = new distcp(conf,null) toolrunner.run(distcp, filepaths)  } }

by adding below code above issue fixed

code

 val makedir: path = new path(filepaths(1))  makedir.getfilesystem(conf).mkdirs(makedir)

Search This Blog

How Y

scala - hadoop distcp doesnot create folder when we pass single file -

Comments

Post a Comment

Popular posts from this blog

php - Cannot override Laravel Spark authentication with own implementation -

Qt QGraphicsScene is not accessable from QGraphicsView (on Qt 5.6.1) -

What is happening when Matlab is starting a "parallel pool"? -