Text file record count using pool class in python -
i have program list , read files in directory , counts total number of records present in files concurrently.
when i'm runnning below code list of worker thread names counts coming in chunk counting of records multiple files going parallel.
import multiprocessing mp import time import os path = '/home/vaibhav/desktop/input_python' def process_line(f): print(mp.current_process()) #print("process id = " , os.getpid(f)) print(sum(1 line in f)) filename in os.listdir(path): print(filename) if __name__ == "__main__": open('/home/vaibhav/desktop/input_python/'+ filename, "r+") source_file: # chunk work batches p = mp.pool() results = p.map(process_line, source_file) start_time = time.time() print("my program took", time.time() - start_time, "to run") current output
<forkprocess(forkpoolworker-54, started daemon)> 73 <forkprocess(forkpoolworker-55, started daemon)> <forkprocess(forkpoolworker-56, started daemon)> <forkprocess(forkpoolworker-53, started daemon)> 73 1 <forkprocess(forkpoolworker-53, started daemon)> 79 <forkprocess(forkpoolworker-54, started daemon)> <forkprocess(forkpoolworker-56, started daemon)> <forkprocess(forkpoolworker-55, started daemon)> 79 77 77 is there way around can total records count of files like
file1.txt total_recordcount ... filen.txt total_recordcount update got solution , pasted answer in comments section.
counting lines in text file should not cpu-bound, therefore not candidate threading. might want use thread pool processing multiple independent files, single file, here's way count lines should fast:
import pandas pd data = pd.read_table(source_file, dtype='s1', header=none, usecols=[0]) count = len(data) what parse first character (s1) dataframe, , check length. parser implemented in c, there no slow python loop required. should provide close best possible speed, limited disk subsystem.
this sidesteps original problem completely, because single count per file.
Comments
Post a Comment