16.2.3 示例:二分查找和 prefetch

列表 16-21 是一个例子,我们来研究一下。

Listing 16-21.prefetch_binsearch.c

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

#define SIZE 1024*512*16

int binarySearch(int *array, size_t number_of_elements, int key) {
    size_t low = 0, high = number_of_elements-1, mid;
    while(low <= high) {
        mid = (low + high)/2;
#ifdef DO_PREFETCH
        // low path
        __builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
        // high path
        __builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
#endif

        if(array[mid] < key)
            low = mid + 1;
        else if(array[mid] == key)
            return mid;
        else if(array[mid] > key)
            high = mid-1;
    }
    return -1;
}

int main() {
    size_t i = 0;
    int NUM_LOOKUPS = SIZE;
    int *array;
    int *lookups;

    srand(time(NULL));
    array =  malloc(SIZE*sizeof(int));

    lookups = malloc(NUM_LOOKUPS * sizeof(int));
    for (i=0;i<SIZE;i++) array[i] = i;
    for (i=0;i<NUM_LOOKUPS;i++) lookups[i] = rand() % SIZE;

    for (i=0;i<NUM_LOOKUPS;i++)
        binarySearch(array, SIZE, lookups[i]);
    free(array);
    free(lookups);
}

二分查找算法的访问模式导致我们难以对其要访问的内存进行预测。因为这种算法非线性,从起点跳到终点,然后再跳到中间,然后循环往复。让我们看看执行时间的差别。

列表 16-22 展示了 prefetch 关闭的结果。

Listing 16-22.binsearch_prefetch_off

> gcc -O3 prefetch.c -o prefetch_off && /usr/bin/time -v ./prefetch_off
   Command being timed: "./prefetch_off"   
   User time (seconds): 7.56
   System time (seconds): 0.02
   Percent of CPU  this job got: 100%
   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.58
   Average shared text size (kbytes): 0
   Average unshared data size (kbytes): 0
   Average stack size (kbytes): 0
   Average total size (kbytes): 0
   Maximum resident set size (kbytes): 66432
   Average resident set size (kbytes): 0
   Major (requiring I/O) page faults: 0
   Minor (reclaiming a frame) page faults: 16444
   Voluntary context switches: 1
   Involuntary context switches: 51
   Swaps: 0
   File system inputs: 0
   File system outputs: 0
   Socket messages sent: 0
   Socket messages received: 0
   Signals delivered: 0
   Page size (bytes): 4096
   Exit status: 0

列表 16-23 展示了 prefetch 打开的执行结果。

Listing 16-23.binsearch_prefetch_on

> gcc -O3 prefetch.c -o prefetch_off && /usr/bin/time -v ./prefetch_off
   Command being timed: "./prefetch_on"
   User time  (seconds):  6.56
   System time (seconds): 0.01
   Percent of CPU  this job got: 100%
   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.57
   Average shared text size (kbytes): 0
   Average unshared data size (kbytes): 0
   Average stack size (kbytes): 0
   Average total size (kbytes): 0
   Maximum resident set size (kbytes): 66512
   Average resident set size (kbytes): 0
   Major (requiring I/O) page faults: 0
   Minor (reclaiming a frame) page faults: 16443
   Voluntary context switches: 1

用带有 cachegrind 模块的 valgrind 工具,我们可以检查 cache miss 的次数。列表 16-24 列出了关闭 prefetch 的结果,列表 16-25 列出了 prefetch 打开的结果。

大写字母 I 表示指令缓存,D 表示数据缓存,LL 表示 Last Level Cache。数据缓存 miss 几乎是 100%,结果比较糟糕。

Listing 16-24.binsearch_prefetch_off_cachegrind

==25479== Cachegrind, a cache and branch-prediction profiler

==25479== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.

==25479== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info

==25479== Command: ./prefetch\_off

==25479==

--25479-- warning: L3 cache found, using its data for the LL simulation.

==25479==

==25479== I   refs:      2,529,064,580

==25479== I1  misses:              778

==25479== LLi misses:              774

==25479== I1  miss rate:          0.00%

==25479== Lli miss rate:          0.00%

==25479==

==25479== D   refs:         404,809,999   (335,588,367 rd   + 69,221,632 wr)

==25479== D1  misses:       160,885,105   (159,835,971 rd   +  1,049,134 wr)

==25479== LLd misses:       133,467,980   (132,418,879 rd   +  1,049,101 wr)

==25479== D1  miss rate:           39.7%  (       47.6%     +        1.5%  )

==25479== LLd miss rate:           33.0%  (       39.5%     +        1.5%  )

==25479==

==25479== LL refs:          160,885,883   (159,836,749 rd   +  1,049,134 wr)

==25479== LL misses:        133,468,754   (132,419,653 rd   +  1,049,101 wr)

==25479== LL miss rate:             4.5%  (        4.6%     +        1.5%  )

Listing 16-25.binsearch_prefetch_on_cachegrind

==26238== Cachegrind, a cache and branch-prediction profiler

==26238== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.

==26238== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info

==26238== Command: ./prefetch_on

==26238==

--26238-- warning: L3 cache found, using its data for the LL simulation.

==26238==

==26238== I   refs:     3,686,688,760

==26238== I1  misses:             777

==26238== LLi misses:             773

==26238== I1  miss rate:         0.00%

==26238== LLi miss rate:         0.00%

==26238==

==26238== D   refs:       404,810,009   (335,588,374  rd   + 69,221,635 wr)

==26238== D1  misses:     160,887,823   (159,838,690  rd   +  1,049,133 wr)

==26238== LLd misses:     133,488,742   (132,439,642  rd   +  1,049,100 wr)

==26238== D1  miss  rate:        39.7%  (       47.6%      +        1.5%  )

==26238== LLd miss rate:         33.0%  (       39.5%      +        1.5%  )

==26238==

==26238== LL refs:        160,888,600   (159,839,467  rd   +  1,049,133 wr)

==26238== LL misses:      133,489,515   (132,440,415  rd   +  1,049,100 wr)

==26238== LL miss rate:           3.3%  (        3.3%      +        1.5%  )

可以看到,缓存 miss 极大地降低了。

results matching ""

    No results matching ""