What is a Kernel Overhead?
Tag : development
Date : November 25 2020, 07:06 PM

Overhead of supporting Floating Point Arithmetic inside the Linux Kernel

Tag : linux
Date : March 29 2020, 07:55 AM
Hope that helps The usual answer is that if the kernel does not use floating point, it does not have to save the floating-point registers on entry to the kernel or restore them on exit. This shaves several hundred cycles off the cost of all system calls.
I do not know if anyone has tried to compare this savings against the performance improvements that might be available if the kernel could make indiscriminate use of those registers. Note that you can use them in the kernel if you take proper care, and this is done in contexts where tremendous speed benefits are available, e.g. using SSE instructions to accelerate memcpy and the like. (Look for calls to kernel_fpu_begin in the Linux sources.)

Accurate way of measuring overhead in kernel space

Tag : linux
Date : March 29 2020, 07:55 AM
wish of those help I think you want more to measure a typical application payload (as Ninjajl's comment suggests, the compilation of the kernel could be a good payload). You probably don't want to measure the overhead inside each syscall itself, or even inside the kernel as a whole.
The reason for this is that most applications spend much more time and resource in user-space than in kernel-land (i.e. syscalls), so overhead inside syscalls is a "second-order" effect and probably don't matter as much. Of course, there are probable exceptions.

How to measure overhead of a kernel launch in CUDA

Tag : cuda
Date : March 29 2020, 07:55 AM
may help you . How to measure the overhead of a kernel launch in CUDA is dealt with in Section 6.1.1 of the "CUDA Handbook" book by N. Wilt. The basic idea is to launch an empty kernel. Here is a sample code snippet
#include <stdio.h>

__global__ void EmptyKernel() { }

int main() {

    const int N = 100000;

    float time, cumulative_time = 0.f;
    cudaEvent_t start, stop;

    for (int i=0; i<N; i++) { 

        cudaEventRecord(start, 0);
        cudaEventRecord(stop, 0);
        cudaEventElapsedTime(&time, start, stop);
        cumulative_time = cumulative_time + time;


    printf("Kernel launch overhead time:  %3.5f ms \n", cumulative_time / N);
    return 0;

Reducing kernel overhead when reading a huge file with lazy bytestrings

Tag : linux
Date : March 29 2020, 07:55 AM
hop of those help? First, it seems like there is something wonky going on with your machine. When I run this program on a 1G file cached in memory or in an tmpfs filesystem (doesn't matter which), the system time is substantially smaller:
1.44user 0.14system 0:01.60elapsed 99%CPU (0avgtext+0avgdata 50256maxresident)
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>

#define BUFLEN (1024*1024)
char buffer[BUFLEN];

        int nulls = 0;
        int fd = open("/dev/shm/testfile5G.dat", O_RDONLY);
        while (read(fd, buffer, BUFLEN) > 0) {
                for (int i = 0; i < BUFLEN; ++i) {
                        if (!buffer[i]) ++nulls;
        printf("%d\n", nulls);
real    0m2.035s
user    0m1.619s
sys     0m0.416s
import Data.Word
import qualified Data.ByteString.Lazy as BSL

main :: IO ()
main = do
  contents <- BSL.readFile "/scratch/buhr/testfile5G.dat"
  print $ BSL.foldl' go 0 contents
    where go :: Int -> Word8 -> Int
          go n 0 = n + 1
          go n _ = n
real    0m8.411s
user    0m7.966s
sys     0m0.444s
import System.Posix.IO
import Foreign.Ptr
import Foreign.ForeignPtr
import MMAP
import qualified Data.ByteString as BS
import qualified Data.ByteString.Internal as BS

-- exact length of file
len :: Integral a => a
len = 5368709120

main :: IO ()
main = do
  fd <- openFd "/scratch/buhr/testfile5G.dat" ReadOnly Nothing defaultFileFlags
  ptr <- newForeignPtr_ =<< castPtr <$>
    mmap nullPtr len protRead (mkMmapFlags mapPrivate mempty) fd 0
  let contents = BS.fromForeignPtr ptr 0 len
  print $ BS.foldl' (+) 0 contents
real    0m7.972s
user    0m7.791s
sys     0m0.181s
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

size_t len = 5368709120;

        int nulls = 0;
        int fd = open("/scratch/buhr/testfile5G.dat", O_RDONLY);
        char *p = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
        for (int i = 0; i < len; ++i) {
                if (!p[i]) ++nulls;
        printf("%d\n", nulls);
real    0m1.888s
user    0m1.708s
sys     0m0.180s

Minimal overhead way of intercepting system calls without modifying the kernel

Tag : c
Date : March 29 2020, 07:55 AM
