This is a Python implementation of the wildly popular One Billion Rows challenge, initiated originally to be solved only in Java - https://github.com/gunnarmorling/1brc
- All executions and runtimes have been noted while running on Python 3.12 on a Apple M2 Pro machine with a 16 GB RAM and 500 GB hard disk.
createMeasurements.pycopied from https://github.com/ifnesi/1brc/blob/main/createMeasurements.py.- For every iteration, the runtime noted below is the observed best runtime after multiple trials with different batch sizes and other parameter adjustments.
| Iteration Number | Runtime (in seconds) | Comments |
|---|---|---|
| 1 | 1354.95 | Avg + Min + Max, single process |
| 2 | ||
| 3 | ||
| 4 |
- Declare 4 global variables:
total_temp_per_place_all_batches- Dict[str, List[str]] - {place: [total_temp_till_prev_batch, number_of_times_place_has_appeared_till_prev_batch]}avg_temp_per_place_all_batches- Dict[str, float] - {place: avg_temp_till_prev_batch}min_temp_per_place_all_batches- Dict[str, float] - {place: min_temp_till_prev_batch}max_temp_per_place_all_batches- Dict[str, float] - {place: max_temp_till_prev_batch}
- Read the file in batches. Call
batch_calculation()to do the calculation of average batch by batch, as the file is read. Inside this function -- First the tuple object read is split and converted to a list - [Place, Temperature]
- This list is then converted to a dict {Place, Temperature} and inserted into an List[Dict] variable -
input_batch_list. - Finally we are passing this
input_batch_listvariable to thecalc_average_over_entire_data()function.
- The
calc_average_over_entire_data()achieves two objectives -- When called from within
batch_calculation(), it iterates over theinput_batch_listand calculates the average per batch. To do this, we iterate over each {place, temp} combo in theinput_batch_list.- If the place is present:
- Add the temp to the
total_temp_per_place_all_batches[place]and increment itsnumber_of_times_place_has_appeared_till_prev_batchby 1. - Compare the temp with the min temp in
min_temp_per_place_all_batches[place]and if it's less than the existing value, updatemin_temp_per_place_all_batches[place]. - Compare the temp with the max temp in
max_temp_per_place_all_batches[place]and if it's greater than the existing value, updatemax_temp_per_place_all_batches[place].
- Add the temp to the
- If the place is not present:
- Add a new element to
total_temp_per_place_all_batcheswith the place as the key, the temp as thetotal_temp_till_prev_batchand set the value ofnumber_of_times_place_has_appeared_till_prev_batchto 1. - Add a new element to
min_temp_per_place_all_batcheswith the place as the key, the temp as the minimum temp. - Add a new element to
max_temp_per_place_all_batcheswith the place as the key, the temp as the maximum temp.
- Add a new element to
- Finally add/update the avg temp for the place in
avg_temp_per_place_all_batchesby dividing thetotal_temp_till_prev_batchwithnumber_of_times_place_has_appeared_till_prev_batch.
- If the place is present:
- So when
calc_average_over_entire_data()runs iteratively over all batches, the min, max, and avg. for each place will have iteratively been updated in the respective global variables with the place name as the key and the corresponding temp as the value.
- When called from within