Hadoop - Processing multiple files in HDFS via Python

De openkb
Aller à : Navigation, rechercher

Sommaire

Questions

I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?

For example, when I run the script on files in a local directory I have:

cd /path/to/files

for file in *.xml
do
python  /path/processxml.py 
$file > /path2/$file
done

So basically, how would I go about doing the same, but this time the files are in hdfs?

Answers

You basically have two options:

1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:

hadoop jar <the location of the streamlib> 
        -D mapred.job.name=<name for the job> 
        -input /hdfs/input/dir 
        -output /hdfs/output/dir 
        -file your_script.py 
        -mapper python your_script.py 
        -numReduceTasks 0

2) Create a PIG script and ship your python code. Here is a basic example for the script:

input_data = LOAD  /hdfs/input/dir ;
DEFINE mycommand `python your_script.py` ship( /path/to/your/script.py );
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;    
STORE updated_data INTO  hdfs/output/dir ;

Source

License : cc by-sa 3.0

http://stackoverflow.com/questions/35070998/processing-multiple-files-in-hdfs-via-python

Related

Outils personnels
Espaces de noms

Variantes
Actions
Navigation
Outils