e98a620 (Mark partially purged arena chunks as non-hugepage.) attempts to explicitly interact with Linux's transparent huge page (THP) functionality, but it has two shortcomings. First, it makes the mistake of assuming new chunks are created as if madvise(... MADV_HUGEPAGE) had been applied, but that is not the case, so the chunk creation code needs to add an explicit call. More generally, THP requests can cause serious scalability issues depending on the kernel version and configuration. Prior to Linux 4.6 it wasn't even possible to tune the kernel to satisfy THP requests asynchronously. We need to provide a way to opt out of explicit THP requests, so that applications can work around kernel issues as necessary. This can just be an opt.thp option that defaults to true on relevant systems; it's hard to imagine use cases for finer-grained control.
@jasone Can you maybe ellaborate on which exact kernel patches in Linux 4.6 you are referring to and which kernel THP configuration you would recommend using together with full THP support in jemalloc?
@TheCrazyLex, see https://www.kernel.org/doc/Documentation/vm/transhuge.txt for documentation on the "defer" option for /sys/kernel/mm/transparent_hugepage/defrag, which should make it possible to avoid blocking when no THPs are immediately available. Although I don't have personal experience with tuning Linux with this option, it appears to provide a solution to the blocking issues I've seen reports of in the context of jemalloc. (NB: the "defer+madvise" option appears to be brand new.)
The THP support in jemalloc 4.5.0 is a pretty conservative approach, in that it leaves the default THP state alone for huge allocations (2+ MiB), and it also leaves the default state alone for each chunk from which small and large allocations are carved, up until the point where unused dirty pages are purged from within a chunk. Once that happens, the chunk is forced to be non-THP until/unless it is completely discarded, at which point the corresponding virtual memory is restored to the default THP state. The default state depends on /sys/kernel/mm/transparent_hugepage/enabled; see https://www.kernel.org/doc/Documentation/vm/transhuge.txt for details.
I would recommend experimenting with the /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/defrag settings to figure out what works best for you. It may be that "always" and "defer" are a good approach, but depending on application behavior you may be better off with some other combination.