In Basic compilation we indicated [options] for the cmake command, but left them blank because the defaults should work fine in most cases. Here, we'll discuss some common optional packages, features and performance tweaks, and the [options] used to activate them.
Finding libraries / optional libraries
- If you manually installed some of the dependencies, and/or they happen to be installed in non-standard locations, you can tell cmake where they are using [options]. For example, if you want to use a custom-installed the GNU Scientific Library in /home/user/gsl, add -D GSL_PATH=/home/user/gsl to [options]
- You can use Intel's Math Kernel Library (MKL) to provide FFT, BLAS and LAPACK. Add -D EnableMKL=yes to [options], and additionally specify MKL_PATH as indicated above if MKL is in a non-standard location (besides /opt/intel/mkl)
- You can use MKL to provide BLAS and LAPACK, but still use FFTW for Fourier transforms, by adding -D ForceFFTW=yes to [options]. We find this option to often be more reliable than using the MKL FFTs.
- Set -D ThreadedBlas=yes or no to indicate whether the BLAS library is multithreaded or not. When used with MKL, this will automatically select whether JDFTx links to the sequential or threaded layers of MKL.
- LibXC provides additional exchange-correlation functionals. JDFTx can link to LibXC version >= 3; add -D EnableLibXC=yes to options, and if necessary specify LIBXC_PATH
Optional compilation flags
- Add -D EnableProfiling=yes to [options] to get summaries of run times per function and memory usage by object type at the end of calculations.
- Adding -D LinkTimeOptimization=yes will enable link-time optimizations (-ipo for the Intel compilers and -flto for the GNU compilers). Note that this significantly slows down the final link step of the build process.
- Add -D StaticLinking=yes to compile JDFTx statically. This is necessary on Windows and is turned on automatically there. It could also be useful on other platforms to compile on one machine and execute on another without the compiler and support libraries installed.
- At the default optimization level, the compiled executable is not locked to specific CPU features. You can enable machine specific optimizations (-march=native on gcc, -fast on icc) by adding -D CompileNative=yes to [options]. Note however that this might cause your executable to be usable only on machines with CPUs of the same or newer generation than the machine it was compiled on. Also, this rarely provides any real performance benefits, because most of the JDFTx execution time is in the BLAS and FFT libraries anyway.
Changing compilers
The cmake commands in Basic compilation use the default compiler (typically g++) and reasonable optimization flags. Using a different compiler require environment variables rather than [options] passed to cmake. For example, you can use the intel compiler using the command (note bash-specific syntax for environment variables):
CC=icc CXX=icpc cmake [options] ../jdftx-VERSION/jdftx
Make sure the environment variables for the intel compiler (path settings etc.) are loaded before issuing that command (see the compiler documentation / install notes). Of course, you would probably include -D EnableMKL=yes [options] to also use Intel MKL.
Similarly, to use the Clang compiler:
CC=clang CXX=clang++ cmake [options] ../jdftx-VERSION/jdftx
GPU support
For GPU support, install the CUDA SDK (either from the website, or your package manager, if available) and add -D EnableCUDA=yes to [options]. If you get an unsupported compiler error, comment out the GCC version check from $CUDA_DIR/include/host_config.h.
Also consider compiling with the following optional flags:
- -D PinnedHostMemory=yes: use page-locked memory on the host (CPU) to speed up memory transfers to the GPU. Make sure the usage limits allow sufficient / unlimited page-locked memory (see eg. ulimit man page).
- -D CudaManagedMemory=yes: Use CUDA managed/unified memory to let the CUDA driver manage memory transfers between CPU and GPU automatically, instead of JDFTx handling them. This is usually slightly slower than the explicit memory management, but on new-enough GPUs, allows running calculations that don't even fit with the GPU memory. For optimal performance, it is recommended to have builds with and without this flag, and use the managed build only when you need to stretch the limit of the GPU memory.
- -D CudaAwareMPI=yes: If your MPI library supports direct transfers from GPU memory, this flag will speed up MPI data transfers between GPUs.
- Prior to GPU runs, consider setting the environment variable JDFTX_CACHE_SIZE to a large fraction of the memory size in MB / GPU. For example, say "export JDFTX_CACHE_SIZE=5120" (i.e 5 GB) for a GPU with 6 GB memory. This caches GPU pointers, bypassing expensive cudaMalloc / cudaFree calls, while keeping total GPU memory used (in use + cache) below the specified size if possible. GPU allocations are cached even if this size is not set, allowing the cache to expand to use all the GPU memory; this may be a little slower because the code has to rely on failed cudaMalloc calls to detect that it has reached the limit.
- Use the variable JDFTX_CACHE_MARGIN variable to fine tune the cached allocations. Thsi controls the fraction of memory that is allowed to be 'wasted' when reusing a previously freed pointer for a new allocation. For example, the default value of 0.5 allows reusing a pointer with capacity up to 1.5 times the required size, which could end up needing 50% more memory than the absolute minimum usage. For calculations that are tight fits, reduce the margin: this will lead to less wastage, but also less reuse of pointers, increasing the cudaMalloc overhead.
If you want to run on a GPU, it must be a discrete (not on-board) NVIDIA GPU with compute capability >= 1.3, since that is the minimum for double precision. Specify CUDAARCHS=xy based on the minimum compute capability x.y to support. See https://developer.nvidia.com/cuda-gpus for compute capabilities of various GPUs.
Note that you will get a real speedup only if your device has a higher memory bandwidth than your CPU/motherboard/RAM combination, since plane-wave DFT is often a memory-bound computation. Also keep in mind that you need a lot of memory on the GPU to actually fit systems of reasonable size (you probably need at least 10 GB of GRAM to handle moderate-sized systems).
When you compile with GPU support, extra executables jdftx_gpu, phonon_gpu and wannier_gpu will be generated that will run code almost exclusively on the GPUs, in addition to the regular executables that only run on CPUs.