Questions
I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?
For example, when I run the script on files in a local directory I have:
cd /path/to/files
for file in *.xml
do
python /path/processxml.py
$file > /path2/$file
done
So basically, how would I go about doing the same, but this time the files are in hdfs?
Answers
You basically have two options:
1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:
hadoop jar <the location of the streamlib>
-D mapred.job.name=<name for the job>
-input /hdfs/input/dir
-output /hdfs/output/dir
-file your_script.py
-mapper python your_script.py
-numReduceTasks 0
2) Create a PIG script and ship your python code. Here is a basic example for the script:
input_data = LOAD /hdfs/input/dir ;
DEFINE mycommand `python your_script.py` ship( /path/to/your/script.py );
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;
STORE updated_data INTO hdfs/output/dir ;
Source
License : cc by-sa 3.0
http://stackoverflow.com/questions/35070998/processing-multiple-files-in-hdfs-via-python
Related