logo
down
shadow

CUDA QUESTIONS

Do the SM's shown in the "occupancy graph" correspond to `blockIdx.x` or register `%smid`?
Do the SM's shown in the "occupancy graph" correspond to `blockIdx.x` or register `%smid`?
it helps some times Do the SM's shown in the "occupancy graph" correspond to blockIdx.x or register %smid? , Quoting Robert Crovella:
TAG : cuda
Date : January 06 2021, 03:27 AM , By : toki
CUDA translation of AVX permute and shuffle in registers
CUDA translation of AVX permute and shuffle in registers
To fix the issue you can do How do I perform arbitrary permutations of a register float variable (always of length 32)? I have seen suggestions that __shfl_sync will do this, but no example showing this. A numpy version of a simple case of what I wan
TAG : cuda
Date : January 02 2021, 10:54 PM , By : CM.
Interpreting the verbose output of ptxas, part II
Interpreting the verbose output of ptxas, part II
will help you This question is a continuation of Interpreting the verbose output of ptxas, part I . , Is cmem short for constant memory?
TAG : cuda
Date : January 02 2021, 06:48 AM , By : francisco santos
CUDA: cudaMemcpy only works in emulation mode
CUDA: cudaMemcpy only works in emulation mode
I wish this helpful for you You should check for errors, ideally on each malloc and memcpy but just doing it once at the end will be sufficient (cudaGetErrorString(cudaGetLastError()).Just to check the obvious:
TAG : cuda
Date : January 02 2021, 06:48 AM , By : niel
Allocate constant memory
Allocate constant memory
I wish this helpful for you Unfortunately the __ constant __ must be in the same file scope as the memcpy to the symbol, and in your case your __ constant __ is in a separate .cu file.The simple way around this is to provide a wrapper function in you
TAG : cuda
Date : January 02 2021, 06:48 AM , By : Mikael
Cuda GPU optimization
Cuda GPU optimization
To fix this issue These are a few striking examples from natural sciences:Ab initio quantum chemistry calculation (TeraChem): up to 50x
TAG : cuda
Date : January 02 2021, 06:48 AM , By : Paulh
Issued load/store instructions for replay
Issued load/store instructions for replay
I think the issue was by ths following , The NVIDIA architecture optimized memory throughput by issuing an instruction for a group of threads called a warp. If each thread accesses a consecutive data element or the same element then the access can be
TAG : cuda
Date : December 31 2020, 08:18 AM , By : Paulh
Nested loops modulo permutation in cuda
Nested loops modulo permutation in cuda
I wish did fix the issue. One possible approach is given on this wikipedia page ("Finding the k-combination for a given number") for a closed-form solution to convert a linear index into a unique C(n,3) combination. However it will involve calculatin
TAG : cuda
Date : December 27 2020, 02:42 PM , By : Alpinfish
CUDA global atomic operations across concurrent kernel executions
CUDA global atomic operations across concurrent kernel executions
will help you Yes, it's possible. atomic operations to global memory are device-wide. They will be atomic with respect to any code running on the device.
TAG : cuda
Date : December 25 2020, 07:01 PM , By : PatrickSimonHenk
How to fix cudaError 77 when copying back from device to host
How to fix cudaError 77 when copying back from device to host
I wish this helpful for you I am writing a simple example program to test memCpy and kernel run concurrency for a larger program. While writing this example, I stumbled upon error 77, aka cudaErrorIllegalAddress. , There are several problems with thi
TAG : cuda
Date : December 25 2020, 09:19 AM , By : ArmHead
CUDA cuSolver gesvdj with large matrix
CUDA cuSolver gesvdj with large matrix
hop of those help? Thank you to @talonmies for the help in diagnosing the problem. cusolver's gesvdj method has an economy mode which stores the U and V matrices in more economical arrays. The modifications I made to make the code work are simple.
TAG : cuda
Date : December 25 2020, 04:01 AM , By : Pepe Araya
How to create a persistent framebuffer in cuda
How to create a persistent framebuffer in cuda
I hope this helps . After chasing this down for a while, I noticed that the a problem when trying to allocate and init the __device__ float* accum_buffer. On line 96, I see that that width and height are zero. So therefore the accum_buffer is being a
TAG : cuda
Date : December 24 2020, 05:01 AM , By : waarg
How to transpose a sparse matrix in cuSparse?
How to transpose a sparse matrix in cuSparse?
I hope this helps you . For matrices that are in CSR (or CSC) format:The CSR sparse representation of a matrix has identical format/memory layout as the CSC sparse representation of its transpose.
TAG : cuda
Date : December 23 2020, 07:06 AM , By : user167963
Upscaling data in Cuda
Upscaling data in Cuda
will be helpful for those in need It is, quite frankly, getting a bit tedious watching you post new versions of this code, pronouncing that they now either work or don't work when all of them have had the same or related indexing issues within the ke
TAG : cuda
Date : December 21 2020, 11:01 PM , By : kiirpi
Can I find price floors and ceilings with cuda
Can I find price floors and ceilings with cuda
it helps some times My question is, is there a way to find these reversal point in parallel?
TAG : cuda
Date : December 06 2020, 11:49 PM , By : kakashi_
the Difference between running time and time of obtaining results in CUDA
the Difference between running time and time of obtaining results in CUDA
it should still fix some issue The confusion here seems to have arisen out of using a host-based timing method to time what is (mostly) device activity.Kernel launches are asynchronous. The host code launches the kernel, and then proceeds without wai
TAG : cuda
Date : November 25 2020, 09:00 AM , By : WuJanJai
Is there an alternative of std::memcmp in cuda?
Is there an alternative of std::memcmp in cuda?
Hope this helps While not functionally identical to std::memcmp, the thrust template library includes a comparison reduction operation thrust::equal which will return true or false when the elements of two iterator ranges compare identically.If you a
TAG : cuda
Date : November 24 2020, 05:48 AM , By : Eric
cuda group by and atomic counters
cuda group by and atomic counters
This might help you to the kind reminder of talonmies about the return value of atomicAdd, I've beed able to fix my kernel to this:
TAG : cuda
Date : November 24 2020, 05:48 AM , By : TheMoo
Bug in PTX ISA (carry propagation)?
Bug in PTX ISA (carry propagation)?
wish of those help I believe you're making some assumptions about the CC.CF flag referenced in the PTX ISA documentation that may not be valid.Note that the definition of specific states (e.g. 0 or 1) of this bit are never given that I can see. Furth
TAG : cuda
Date : November 24 2020, 05:44 AM , By : Brian Drum
Eliminate cudaMemcpy between kernel calls
Eliminate cudaMemcpy between kernel calls
Does that help I've got a CUDA kernel that is called many times (1 million is not the limit). Whether we launch kernel again or not depends on flag (result_found), that our kernel returns. , 1) Is there any way to avoid calling cudaMemcpy here?
TAG : cuda
Date : November 21 2020, 09:01 AM , By : Thierry Brunet
Finding device ID from kernel thread
Finding device ID from kernel thread
it fixes the issue If your device supports cuda dynamic parallelism, you can use the cudaGetDevice() call in device code as documented here:
TAG : cuda
Date : November 21 2020, 07:38 AM , By : TomL
how does cublas implement asynchronous scalar variable transmission
how does cublas implement asynchronous scalar variable transmission
wish help you to fix your issue If its a host resident scalar it can be passed by value as a kernel parameter. If it's device resident then a pointer to it can be passed as a kernel parameter.
TAG : cuda
Date : November 21 2020, 07:35 AM , By : Anand
Segmentation error when using thrust::sort in CUDA
Segmentation error when using thrust::sort in CUDA
Hope that helps cutInfoptr is a pointer of type TetraCutInfo having the address of the device memory allocated using cudaMalloc.
TAG : cuda
Date : November 20 2020, 01:01 AM , By : jehammon
thrust reduction result on device memory
thrust reduction result on device memory
should help you out Is it possible to leave the return value of a thrust::reduce operation in device-allocated memory?
TAG : cuda
Date : November 18 2020, 03:42 PM , By : Anand
thrust transform defining custom binary function
thrust transform defining custom binary function
I wish this help you It appears that you want to pass two input vectors to thrust::transform and then do an in-place transform (i.e. no output vector is specified).There is no such incarnation of thrust::transform
TAG : cuda
Date : November 14 2020, 05:16 PM , By : zdyne
CUDA : cuSolver raises an exception
CUDA : cuSolver raises an exception
this will help I am trying to use cusolver library to solve a number of linear equations but instead an exception is raised which is very strange. the code is using only one function from the library and the rest is memory allocation and memory copy.
TAG : cuda
Date : November 14 2020, 07:01 AM , By : iyogee
Ray-tracing and CUDA
Ray-tracing and CUDA
I wish did fix the issue. It seems all the errors are reported for lines where you try to double-index things. The line numbering is a little off, but from the warning kernel.cu(27): warning: variable "int_faces" was set but never used it can be dedu
TAG : cuda
Date : November 12 2020, 04:01 AM , By : Joe
numba cuda does not produce correct result with += (gpu reduction needed?)
numba cuda does not produce correct result with += (gpu reduction needed?)
it should still fix some issue Yes, a proper parallel reduction is needed to sum data from multiple GPU threads to a single variable.Here's one trivial example of how it could be done from a single kernel:
TAG : cuda
Date : November 11 2020, 04:01 AM , By : Michael T.
Reading from an unaligned uint8_t recast as a uint32_t array - not getting all values
Reading from an unaligned uint8_t recast as a uint32_t array - not getting all values
I hope this helps you . If you want bytes 2..6, you're going to have to combine multiple aligned loads to get what you want.
TAG : cuda
Date : November 10 2020, 11:01 PM , By : billputer
Installing CUDA as a non-root user with no GPU
Installing CUDA as a non-root user with no GPU
hop of those help? Assuming you want to develop codes that use the CUDA runtime API, you can install the cuda toolkit on a system that does not have a GPU. Using the runfile installer method, simply answer no when prompted to install the driver.
TAG : cuda
Date : November 07 2020, 09:00 AM , By : CodeOfficer
Utilizing Register memory in CUDA
Utilizing Register memory in CUDA
To fix the issue you can do Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler w
TAG : cuda
Date : October 31 2020, 05:38 PM , By : walshtp
The computation of global memory load transactions in CUDA kernel
The computation of global memory load transactions in CUDA kernel
I wish this helpful for you There is only a small point that you missed. Global memory access is coalesced only for threads within a warp (see the programming guide). In your case there are 4 warps. Each will need one memory transaction for the eleme
TAG : cuda
Date : October 28 2020, 08:10 AM , By : kema
How to combine thrust comparisons using placeholders?
How to combine thrust comparisons using placeholders?
it should still fix some issue This can be done with two thrust::partition operations. Partitioning is pretty simple: everything that results in a true predicate is moved to the left side of the input vector. Everything else is moved to the right. He
TAG : cuda
Date : October 24 2020, 08:10 PM , By : Revision17
Modifying zip iterator with eigen::Matrix gives errenous results
Modifying zip iterator with eigen::Matrix gives errenous results
help you fix your problem Probably you just need to get the latest Eigen.I used CUDA 9.2 on Fedora27, and grabbed the latest eigen from here.
TAG : cuda
Date : October 24 2020, 08:10 AM , By : user103892
Cuda get gpu load percent
Cuda get gpu load percent
I wish this help you http://eliang.blogspot.com.by/2011/05/getting-nvidia-gpu-usage-in-c.html?m=1
TAG : cuda
Date : October 16 2020, 03:08 PM , By : Noah
How to dynamically set the size of device_vectors in thrust set operations?
How to dynamically set the size of device_vectors in thrust set operations?
I hope this helps . In the general case (i.e. across various thrust algorithms) there is often no way to know the output size, except what the upper bound would be. The usual approach here would be to pass a result vector whose size is the upper boun
TAG : cuda
Date : October 16 2020, 08:10 AM , By : jgood
Nvidia Jetson Tx1 against jetson NANO (Benchmarking)
Nvidia Jetson Tx1 against jetson NANO (Benchmarking)
I think the issue was by ths following , I am currently trying to benchmark the Jetson TX1 against the jetson NANO, according to https://elinux.org/Jetson, they both have the maxwell architecture with 128 cuda cores for NANO and 256 for TX1. This mea
TAG : cuda
Date : October 08 2020, 10:00 PM , By : Piotr Balas
Transferring data from CPU to GPU and vice versa where exactly?
Transferring data from CPU to GPU and vice versa where exactly?
I wish did fix the issue. The cudaMalloc function allocates a requested number of bytes in Device global memory of the GPU and gives back the initialised pointer to that chunk of memory. cudaMemcpy takes 4 parameters: Address of pointer to the destin
TAG : cuda
Date : October 08 2020, 08:00 PM , By : Jesse
Multiway stable partition
Multiway stable partition
I think the issue was by ths following , From my knowledge of the thrust internals, there is no readily adaptable algorithm to do what you envisage. A simple approach would be to extend your theoretical two pass three way partition to M-1 passes usin
TAG : cuda
Date : October 07 2020, 08:00 PM , By : DaveF
CUDA index blockDim.y is always 1
CUDA index blockDim.y is always 1
it fixes the issue (I spent quite some time looking for a dupe, but could not find it.)A dim3 variable is a particular data type defined in the CUDA header file vector_types.h.
TAG : cuda
Date : October 07 2020, 06:00 PM , By : user186831
Can a new thread block be scheduled only after all warps in previous thread block finish?
Can a new thread block be scheduled only after all warps in previous thread block finish?
With these it helps No, more than one thread block can be scheduled on one multiprocessor if there are sufficient resources.Yes, TB0 and TB1 can be scheduled on the same SM, resources permitting - although I would not call that "by interleaving warps
TAG : cuda
Date : October 06 2020, 05:00 PM , By : doctorbigtime
Correct way of using cuda.jit in Numba
Correct way of using cuda.jit in Numba
hop of those help? Trying to figure out how to do matrix vector multiplication in cuda.jit in Numba, but I'm getting wrong answers , There are at least 2 errors in your code:
TAG : cuda
Date : October 04 2020, 04:00 PM , By : s8k
CUDA: How many default streams are there on a single device?
CUDA: How many default streams are there on a single device?
Any of those help By default, CUDA has a per-process default stream. There is a compiler flag --default-stream per-thread which changes the behaviour to per-host-thread default stream, see the documentation.Note that streams and host threads are prog
TAG : cuda
Date : October 04 2020, 10:00 AM , By : Asbie
Can CUDA store 8 unsigned char data in parallel
Can CUDA store 8 unsigned char data in parallel
this will help Just expanding comments into an answer:Every older version of the compiler I tested (8.0, 9.1, 10.0) will emit two st.global.v4.u8 instructions in PTX (i.e. two 32bit writes) for the uchar_8 assignment at the end of your kernel. CUDA 1
TAG : cuda
Date : October 02 2020, 06:00 PM , By : damomurf
Cuda global memory load and store
Cuda global memory load and store
I wish this help you I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished.
TAG : cuda
Date : September 30 2020, 06:00 AM , By : user158220
How to copy the pointer variables of array of structures from host to device in CUDA
How to copy the pointer variables of array of structures from host to device in CUDA
I wish this help you I want to copy the array of structure from host to device in different ways.I can cable to copy full structure form host to device but unable to copy individual element of structure from host to device while one of the element is
TAG : cuda
Date : September 29 2020, 07:00 PM , By : AdrianB
What's the differences between the kernel fusion and persistent thread?
What's the differences between the kernel fusion and persistent thread?
may help you . The idea behind kernel fusion is to take two (or more) discrete operations, that could be realized (and might already be realized) in separate kernels, and combine them so the operations all happen in a single kernel.The benefits of th
TAG : cuda
Date : September 29 2020, 05:00 PM , By : Ben
cudaEventElapsedTime and nvprof runtime
cudaEventElapsedTime and nvprof runtime
wish helps you Your first measurement (based on elapsed time) includes kernel launch overhead. The second (based on CUDA events) mostly excludes the launch overhead.Given that your kernel does absolutely nothing (the single memory load will be optimi
TAG : cuda
Date : September 28 2020, 04:00 AM , By : dexteryy
Link .ll files generated by compiling .cu file with clang
Link .ll files generated by compiling .cu file with clang
will help you The CUDA compilation trajectory in Clang is rather complicated (as it is in the NVIDIA toolchain) and what you are trying to do cannot work. The LLVM IR from each branch of the compilation process must remain separate until directly lin
TAG : cuda
Date : September 27 2020, 10:00 AM , By : Anonymous
CUDA-why it cannot printf the information in cuda code?
CUDA-why it cannot printf the information in cuda code?
around this issue I am a beginner for cuda. I wrote a test code for testing GPU device. my gpu model is k80. , Add:
TAG : cuda
Date : September 21 2020, 10:00 AM , By : Kbotei
CUDA Cores and Streaming Multiprocessors Count for Inference Speed
CUDA Cores and Streaming Multiprocessors Count for Inference Speed
Does that help That is not a correct interpretation of utilization. 10% utilization means, roughly speaking, 10% of the time, a GPU kernel is running. 90% of the time, no GPU kernel is running. It does not tell you anything about what that GPU kernel
TAG : cuda
Date : September 02 2020, 07:00 PM , By : Joshua Johnson
shadow
Privacy Policy - Terms - Contact Us © scrbit.com