Performance
Although I had promised in the previous post to follow the original series closely, a conversation with a friend prompted me to spend some time to see if I can improve upon the current performance of both Python
and Golang
by leveraging parallel processing. So I am going to take a small detour to figure out if the function performance can be improved using parallel processing before coming back to the database implementation in the next post.
In Python
, numba
allows parallel processing by calling the appropriate APIs and using the parallel option. Golang
has native support for goroutines
which allows concurrency and parallel processing trivially. In this post I will be trying out both approaches to see how they fare.
I will not be rigorously following benchmarking techniques to get the most precise results. The focus will be mostly on the approaches rather than the exact, reproducible results.
Status Quo
Before we start, let us revisit the where we left off, in both Python
and Golang
. I have also added a longer 20M
test to make sure that the tests run slightly longer to reduce the impact of startup/teardown times.
Implementation/Time(s) | 10M | 20M |
---|---|---|
Python | ~25 |
~60 |
Golang | ~15 |
~40 |
We can also take a look at the code that gave the above results:
We start with DivisorSum function. In Python
, we have added JIT compilation using the numba
njit
decorator which gives us the performance benefits to keeping it competitive with the compiled Go
code.
@njit
def divisor_sum(n: int) -> int:
'''Compute the sum of divisors of a positive integer.'''
if n <= 0:
return 0
i = 1
limit = math.sqrt(n)
the_sum = 0
while i <= limit:
if n % i == 0:
the_sum += i
if i != limit:
the_sum += n // i
i += 1
return the_sum
This is the equivalent Golang
code.
func DivisorSum(n int64) (int64, error) {
if n <= 0 {
return 0, errors.New("value cannot be less than or equal to zero")
}
limit := math.Sqrt(float64(n))
sum := int64(0)
for i := int64(1); float64(i) <= limit; i++ {
if n%i == 0 {
sum += i
if float64(i) != limit {
sum += n / i
}
}
}
return sum, nil
}
On to the wrapper function in Python
, which runs the above function across all the numbers serially, to find the best Witness.
def witness_value(n: int) -> float:
denominator = n * math.log(math.log(n))
return divisor_sum(n) / denominator
def best_witness(max_range: int, search_start: int = SEARCH_START) -> int:
return max(range(search_start, max_range), key=witness_value)
The Golang
implementation is trivial as well
func BestWitness(maxRange, searchStart int64) (int64, float64) {
maxVal := 0.0
bestWitness := searchStart
for i := searchStart; i < maxRange; i++ {
currentWitness := WitnessValue(i, -1)
if currentWitness > maxVal {
bestWitness = i
maxVal = currentWitness
}
}
return bestWitness, maxVal
}
Finally, the tests in each language.
Python:
def test_best_witness_million():
start_time = datetime.datetime.now()
output = best_witness(20_000_000, search_start = 11000)
end_time = datetime.datetime.now()
print("Total time: ", end_time - start_time)
assert 55440 == output
Golang:
It("should find best witness successfully", func() {
count_till := int64(20_000_000)
output, witnessVal := riemann.BestWitness(count_till, 11000)
fmt.Println("\nCurrent Best till", humanize.Comma(int64(count_till)), "is", output, "at value", witnessVal)
Expect(output).To(Equal(int64(55440)))
})
Parallelization
Python
For python, numba
makes parallelizing a piece of cake by adding the parallel=True
argument to the njit
decorator. However, while doing so, the parallel functions do not necessarily run in any particular order and we cannot map
it to an object like we did previously. Instead, we need to make sure that the array is preallocated and that each parallel run can independently write to this preallocated array. At the end of the calculation, we use np.argmax
to find the index of the object with the maximum value.
given below:
@njit(parallel = True)
def best_witness(max_range: int, search_start: int = SEARCH_START) -> int:
witness_array = np.zeros(max_range)
for i in prange(search_start, max_range):
witness_array[i] = witness_value(i)
return np.argmax(witness_array)
With the above parallel code, in my Macbook M1 Air, the process ran in ~(20) seconds. This is 3x the speed of the non-parallelized version!
Golang
In golang, there is a little bit more effort to be done in parallelizing the code. First we need to define a Witness structure to hold the results.
type witness struct {
idx int64
value float64
}
We make a buffered channel of the witness
type with a buffer size equal to the number of iterations:
allValues := make(chan witness, maxRange-searchStart)
Then we make a sync group and add the counter equivalent to the number of iterations to make sure that we proceed only when all of the iterations are done.
var wg sync.WaitGroup
wg.Add(int(maxRange - searchStart))
Now we iterate across the elements and start one goroutine for each iteration of finding the Witness value. Within each iteration, push the output to the channel once the process has calculated the results. We also close the Waitgroup counter using the defer
keyword once the function has finished running.
for i := searchStart; i < maxRange; i++ {
go func(j int64) {
defer wg.Done()
currentWitness := WitnessValue(j, -1)
allValues <- witness{j, currentWitness}
}(i)
}
We wait for all the iterations to be done using the wg.Wait()
command and then close the channel. Then we can iterate over the channel to find the index of the elements with the maximum Witness value.
wg.Wait()
close(allValues)
bestWitness := int64(0)
maxVal := -1.0
for val := range allValues {
if val.value > maxVal {
maxVal = val.value
bestWitness = val.idx
}
}
Here is the final function in all its glory -
func BestWitness(maxRange, searchStart int64) (int64, float64) {
allValues := make(chan witness, maxRange-searchStart)
var wg sync.WaitGroup
wg.Add(int(maxRange - searchStart))
for i := searchStart; i < maxRange; i++ {
go func(j int64) {
defer wg.Done()
currentWitness := WitnessValue(j, -1)
allValues <- witness{j, currentWitness}
}(i)
}
wg.Wait()
close(allValues)
bestWitness := int64(0)
maxVal := -1.0
for val := range allValues {
if val.value > maxVal {
maxVal = val.value
bestWitness = val.idx
}
}
return bestWitness, maxVal
}
The above function took almost 40 seconds to run, which is roughly the same as the non-parallel version! However, I could see that the parallel cores were getting used but I realized that the overhead of creating 20million goroutines was too much, causing Amdahl’s law to come in to play, negating any advantages of parallelization. We will have to do better!
Parallel Optimization
In the next iteration of the code, I try a more optimized approach, wherein I restrict the parallelism to a lower number, say 100. The work splitting function is smarter, and split the work across the limited number of goroutines evenly.
func BestWitness(maxRange, searchStart int64) (int64, float64) {
parallelism := int64(100)
allValues := make(chan witness, parallelism)
var wg sync.WaitGroup
wg.Add(int(parallelism))
for i := int64(0); i < parallelism; i++ {
go func(j int64) {
defer wg.Done()
maxVal := 0.0
bestWitness := searchStart
for k := searchStart + j; k < maxRange; k += parallelism {
currentWitness := WitnessValue(k, -1)
if currentWitness > maxVal {
maxVal = currentWitness
bestWitness = k
}
}
allValues <- witness{bestWitness, maxVal}
}(i)
}
wg.Wait()
close(allValues)
bestWitness := int64(0)
maxVal := -1.0
for val := range allValues {
if val.value > maxVal {
maxVal = val.value
bestWitness = val.idx
}
}
return bestWitness, maxVal
}
Now the test runs in 10s, which is ~4x
faster than the non-parallel version!
Revisiting Python
We can now refactor the python implementation using the the same optimization logic.
@njit(parallel=True)
def best_witness(max_range: int, search_start: int = SEARCH_START) -> int:
parallelism = 100
witness_array = np.zeros(max_range)
for i in prange(parallelism):
current_best_witness = -1
for k in range(search_start + i, max_range, parallelism):
witness = witness_value(k)
if witness > current_best_witness:
current_best_witness = witness
witness_array[k] = witness
return np.argmax(witness_array)
}
This optimized implementation takes ~15 seconds to run, which is 4x faster than the non-parallel version, and 1.3x faster than the naive implementation!
big.LITTLE
I was initially concerned that the maximum performance improvement that we could attain with multi core processing was only ~4x, and not closer to 7x with 8 cores, in my Macbook. Turns out the M1 processor has a big.LITTLE architecture with 4 heavy duty cores and 4 smaller efficiency cores which are much slower. This means, we’re constrained heavily by the big 4 cores and it is unlikely that we can get a faster performance. Hence, the 4x speed up that we got in both Python
and Golang
makes perfect sense.
Conclusion
Here are the final results of the different approaches we tried out:
Implementation/Time(s) | serial | parallel | parallel(optimized) |
---|---|---|---|
Python | ~60 |
~20 |
~15 |
Golang | ~40 |
~40 |
~10 |
Looks like we were able to improve both the implementations by 4x, which I suppose is the ideal maximum speed up we can get. With the best implementation, Golang
continues to retain the 30% speedup over Python
. However,the Python
implementation was way simpler than the Golang
one which is always an aspect (probably the most important one) we need to consider before running behind benchmarks.