发表文章

[C] memcached 崩溃在 Debian 9 (memcached 1.4.33 和相似一个在 1.5.0) memcached crash on Debian 9 ( memcached 1.4.33 and similar one on 1.5.0)[memcached]

farshidce 2017-10-9 119

我们已经看到 memcached 1.4.33 坠毁在我们的许多 Debian 9 机与这个错误
7月29日 07:53:20 cid1004 仁: [3952384.853073] memcached [46864]: 错误在 7ee15bfff010 ip 0000560640fb2bdd sp 00007f3f60ea8f30 错误4在 memcached [560640f9e000 + 23000]
~

8月1日 10:33:30 cid1002 仁: [4220877.876325] memcached [45061]: 错误在 7f9433fff010 ip 000055b8bf126bdd sp 00007ff4bf7fdf30 错误4在 memcached [55b8bf112000 + 23000]

当我们升级到 1.5.0 (我们拿起 Debian 包) 的许多 memcached 实例再次崩溃这次与一个新的回溯

memcached [16448]: 错误在 7f016ffff010 ip 000055f569bcbead sp 00007f5faeffcf30 错误4在 memcached [55f569bb5000 + 26000]

1.4.33 崩溃发生在2周的正常运行过程中, 但每一个崩溃发生在不同的时间。然而1.5.0 崩溃发生了, 当少量一百个客户断开和连接 memcached 因此它的可能崩溃的起因是不同的太

因为这些是 debian 的版本, 我无法获得回溯。如果有人更熟悉 Debian 的建立, 知道如何提取这样的信息, 我会很乐意调查更多。

root@cid1010:~ uname
linux cid1010 4.9. 0-3-amd64 #1 SMP Debian 4.9. 30-1 (2017-06-04) x86_64 GNU/linux

          total        used        free      shared  buff/cache   available

内存: 503G 37G 457G 3.7G 8.8G 459G
交换: 1.9G 0B 1.9G

Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
stats
STAT pid 24938
STAT uptime 1158
STAT time 1503673490
STAT version 1.5.0
STAT libevent 2.0.21-stable
STAT pointer_size 64
STAT rusage_user 60.132000
STAT rusage_system 47.532000

memcached 由 systemd 开始

[Service]
LimitNOFILE=32768
ExecStart=/usr/bin/numactl --interleave=all /usr/share/memcached/scripts/systemd-memcached-wrapper /srv/etc/%i.conf

文件:

-m 360000
-p 11241
-u memcache
-l 0.0.0.0
-c 8192
-d
原文:

we have seen memcached 1.4.33 crash on many of our Debian 9 machines with this segfault
Jul 29 07:53:20 cid1004 kernel: [3952384.853073] memcached[46864]: segfault at 7ee15bfff010 ip 0000560640fb2bdd sp 00007f3f60ea8f30 error 4 in memcached[560640f9e000+23000]
~

Aug 1 10:33:30 cid1002 kernel: [4220877.876325] memcached[45061]: segfault at 7f9433fff010 ip 000055b8bf126bdd sp 00007ff4bf7fdf30 error 4 in memcached[55b8bf112000+23000]

after we upgraded to 1.5.0 (we picked up the Debian package) many of the memcached instances crashed again this time with a new backtrace

memcached[16448]: segfault at 7f016ffff010 ip 000055f569bcbead sp 00007f5faeffcf30 error 4 in memcached[55f569bb5000+26000]

the 1.4.33 crash happened in the course of 2 weeks of uptime but each crash happened in different time. The 1.5.0 crash however happened when few hundred clients disconnected and re-connected tot memcached so its possible the cause of the crash is different too

since these are debian builds I was not able to obtain the backtrace. If someone is more familiar with Debian builds and knows how to extract such info I will be happy to investigate more.

root@cid1010:~# uname -a
Linux cid1010 4.9.0-3-amd64 #1 SMP Debian 4.9.30-1 (2017-06-04) x86_64 GNU/Linux

          total        used        free      shared  buff/cache   available

Mem: 503G 37G 457G 3.7G 8.8G 459G
Swap: 1.9G 0B 1.9G

Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
stats
STAT pid 24938
STAT uptime 1158
STAT time 1503673490
STAT version 1.5.0
STAT libevent 2.0.21-stable
STAT pointer_size 64
STAT rusage_user 60.132000
STAT rusage_system 47.532000

memcached is started by systemd

[Service]
LimitNOFILE=32768
ExecStart=/usr/bin/numactl --interleave=all /usr/share/memcached/scripts/systemd-memcached-wrapper /srv/etc/%i.conf

conf file:

-m 360000
-p 11241
-u memcache
-l 0.0.0.0
-c 8192
-d
相关推荐
最新评论 (46)
dormando 2017-10-9
1

你从哪里得到这些版本的?我想9有 1.4.33, 但1.5.0 从哪里来?

你有没有可能从错误那里得到一 addr2line?即, 取 ip 后的数字并运行addr2line -e memcached $num

此外, 只是为了理智的缘故, 因为在你的自由输出格式是很难阅读;你有目的地给它 360 gb 的 ram 是吗?

原文:

where're you getting these builds from? I think 9 has 1.4.33, but where'd 1.5.0 come from?

Any chance you could get an addr2line from the segfault? ie, take the number after ip and run addr2line -e memcached $num

Also, just for sanity's sake since the formatting in your free output is tough to read; you are purposefully giving it 360 gigabytes of ram yes?

farshidce 2017-10-9
2

我拿起 deb 包从
https://packages.debian.org/sid/memcached

我曾试图从 debian.org 下载源压缩, 但 addr2line 仍然没有工作。最后我检查的是, memcached 二进制 wasent 编译的旗帜, addr2line 需要, 我会继续寻找, 看看我是否能得到 addr2line 在那里工作。

是的, 我们想使用360gb 和这一崩溃, 甚至前一个发生时, memcache 已经满了, 并正在驱逐项目, 不知道这是否重要

原文:

I picked up the deb package from
https://packages.debian.org/sid/memcached

I had tried to download the source tarball from the debian.org but addr2line still didn't work. Last I checked was that the memcached binary wasent compiled with flags that addr2line needs , I will keep looking to see if i can get addr2line to work there.

Yes we want to use 360gb and this crash and even the previous one happened when the memcache was already full and was evicting items , not sure if that matters

dormando 2017-10-9
3

你能直接从主源压缩运行机器吗?如果您在二进制文件上复制, 那么编译起来就很容易, 并且应该插入现有的 systemd 脚本中。

原文:

any chance you could run a machine straight from the main source tarball? it's pretty easy to compile and should slot into your existing systemd scripts if you copy over the binary.

dormando 2017-10-9
4

(但也可以; 你应该能够抓住 debian 下的符号, 使 addr2line 的工作, 所以其中一个选项将有希望的帮助)。我还没有听说过任何事故, 所以我希望尽快解决。

原文:

(but also; you should be able to grab the symbols extras under debian and make addr2line work, so one of those options will hopefully help). I've not heard of any crashes in a while so I hope to resolve this quickly.

farshidce 2017-10-9
5

我会尝试两个, 但鉴于这是不容易重现到目前为止, 我将首先尝试添加缺少的符号, 使 add2line 在现有的 memcached 二进制文件的工作, 然后使用它来映射1.4.33 和 1.5.0 gtk。

原文:

I will try both but given that this is not easily reproducible so far I will first try to add missing symbols to make add2line work on existing memcached binaries and then use it to map 1.4.33 and 1.5.0 segfaults.

dormando 2017-10-9
6

祝你好运。让我知道, 如果有什么我可以做的同时。

原文:

good luck. let me know if there's anything I can do in the meantime.

dormando 2017-10-9
7

也;如果我可以得到完整的统计输出的副本 (私下, 如果这是麻烦) (统计, 统计项目, 统计数据板, 统计设置), 它可以帮助缩小哪些代码路径是有问题的。奖金点, 如果他们是从一个实例坠毁, 并相对接近崩溃。

谢谢!

原文:

also; if I could get copies of full stats output (privately, if that's trouble) (stats, stats items, stats slabs, stats settings) it can be helpful in narrowing down which code paths are problematic. bonus points if they're from an instance that crashed and relatively close to the crash.

thanks!

farshidce 2017-10-9
8

不幸的是, 我没有这些 memcached 的统计输出。我们只收集 curr_items, 扔掉剩下的。
我仍在浏览 Debian wiki 页面, 想知道如何为1.5.0 添加丢失的符号, 但我认为我有正确的构建和符号1.4.33 崩溃, 我们已经看到了。我在5崩溃时运行了信息符号, 它们都指向:
assoc_maintenance_thread + 189 节. 文本
(gdb) 信息符号0x14bdd
assoc_maintenance_thread + 189 节. 文本
(gdb) 信息符号0x14bdd
assoc_maintenance_thread + 189 节. 文本

我重新的 memcached 在 Debian 9 使用 dpkg-buildpackage -rfakeroot -uc -us 从源压缩下载使用 apt-get source memcached 1.5. 0-1 和重新信息符号:

Reading symbols from memcached...done.
(gdb) info symbol 0x16EAD
assoc_delete + 13 in section .text

据我所知 buildpackage 应该重建相同的二进制, 但我可能是错的。我想我必须确保我构建的机器也应该与 Debian 人在这里使用的构建机器匹配
https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/memcached.html

我还启用了数据收集的 1.5.0, 以便我可以发送的 "统计" 下一次崩溃。

原文:

unfortunately I don't have the stats output from these memcached. we only collect curr_items and throw away the rest.
I am still going through Debian wiki pages to figure out how to add the missing symbols for 1.5.0 but I think i have the right build and symbols for 1.4.33 crash we have seen. I ran info symbols on 5 crashes and they all point to :
assoc_maintenance_thread + 189 in section .text
(gdb) info symbol 0x14bdd
assoc_maintenance_thread + 189 in section .text
(gdb) info symbol 0x14bdd
assoc_maintenance_thread + 189 in section .text

I re-built the memcached on Debian 9 using dpkg-buildpackage -rfakeroot -uc -us from the source tarball that was downloaded using apt-get source memcached of 1.5.0-1 and re-ran info symbols :

Reading symbols from memcached...done.
(gdb) info symbol 0x16EAD
assoc_delete + 13 in section .text

from what I understand buildpackage should rebuild the same binary but I could be wrong. I guess I have to make sure the machine I build should also match the build machine used by Debian folks here
https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/memcached.html

I also enabled stats collection on the 1.5.0 so that I can send over the "stats" next time it crashes.

dormando 2017-10-9
9

不知道你为什么要重建它。调试符号将自动从正式生成中创建:
https://wiki.debian.org/AutomaticDebugPackages -您必须添加 debian-调试回购, 然后安装 memcached-dbgsym 到目前为止, 我可以告诉。那 addr2line 应该管用?

原文:

not sure why you have to rebuild it. The debug symbols are automatically created from the official builds:
https://wiki.debian.org/AutomaticDebugPackages - you have to add the debian-debug repos then install memcached-dbgsym so far as I can tell. Then the addr2line should work?

dormando 2017-10-9
10

也;你可以只抓取一个统计转储 (我所要求的所有命令) 从任何机器, 是理想的充分和退出给我一个开始。

原文:

also; you can just grab a stats dump (for all the commands I asked) from any machine that's ideally full and evicting to give me a start.

farshidce 2017-10-9
11

是的, 它适用于大多数版本, 但不知何故, dbgsymb 版本的 1.5. 0-1 的 amd64 是从所有 Debian 回购 (e. g http://debug.mirrors.debian.org/debian-debug/pool/main/m/memcached/) 中丢失

dbgsymb 版本可用于 1.5.0 b1, 它与1.5 不同. 0-1
我将不得不问 dbgsymb 包的维护, 看看它是否在任何地方, 我可以下载

原文:

Yeah it's available for most version but Somehow the dbgsymb version of 1.5.0-1 for amd64 is missing from all Debian repos (e.g http://debug.mirrors.debian.org/debian-debug/pool/main/m/memcached/)

The dbgsymb version is available for 1.5.0 b1 which is different from 1.5.0-1
I will have to ask the maintainer of dbgsymb package to see if it's available anywhere I can download from

dormando 2017-10-9
12

奇怪啊我在包裹清单上看到的1.4.33 呢?你有一个错误, 你可以解决太

原文:

ah weird. I saw it in the packages list. what about 1.4.33? you have a segfault of that one you can resolve too

farshidce 2017-10-9
13

安装 memcached-dbgsym 为 1.4.33, 同时使用 gdb 和 addr2line。

基于 sefgault

messages.3.gz:Jul 31 15:47:09 cid1006 kernel: [4152954.040931] memcached[47687]: segfault at 7ec58bfff010 ip 000055e1e64d8bdd sp 00007f2742cc1f30 error 4 in memcached[55e1e64c4000+23000]
root@cid1006:/var/log# dpkg -l | grep memcached
ii  libhashkit-dev                 1.0.18-4.1                     amd64        libmemcached hashing functions and algorithms (development files)
ii  libhashkit2:amd64              1.0.18-4.1                     amd64        libmemcached hashing functions and algorithms
ii  libmemcached-dev               1.0.18-4.1                     amd64        C and C++ client library to the memcached server (development files)
ii  libmemcached11:amd64           1.0.18-4.1                     amd64        C and C++ client library to the memcached server
ii  libmemcachedutil2:amd64        1.0.18-4.1                     amd64        library implementing connection pooling for libmemcached
ii  memcached                      1.4.33-1                       amd64        high-performance memory object caching system
ii  memcached-dbgsym               1.4.33-1                       amd64        Debug symbols for memcached
root@cid1006:/var/log# addr2line -e /usr/bin/memcached 14bdd
./assoc.c:230
root@cid1006:/var/log# addr2line -e /usr/bin/memcached 000055e1e64d8bdd
??:0
root@cid1006:/var/log# gdb /usr/bin/memcached 
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/memcached...Reading symbols from /usr/lib/debug/.build-id/0f/00482ef25da1c88117f24a2326795273b7a7ec.debug...done.
done.
(gdb) info symbol 0x000055e1e64d8bdd
No symbol matches 0x000055e1e64d8bdd.
(gdb) info symbol 0x14bdd
assoc_maintenance_thread + 205 in section .text of /usr/bin/memcached

0x14bdd 是 ip > 000055e1e64d8bdd 减去55e1e64c4000

原文:

installed memcached-dbgsym for 1.4.33 and used both gdb and addr2line.

based on sefgault

messages.3.gz:Jul 31 15:47:09 cid1006 kernel: [4152954.040931] memcached[47687]: segfault at 7ec58bfff010 ip 000055e1e64d8bdd sp 00007f2742cc1f30 error 4 in memcached[55e1e64c4000+23000]
root@cid1006:/var/log# dpkg -l | grep memcached
ii  libhashkit-dev                 1.0.18-4.1                     amd64        libmemcached hashing functions and algorithms (development files)
ii  libhashkit2:amd64              1.0.18-4.1                     amd64        libmemcached hashing functions and algorithms
ii  libmemcached-dev               1.0.18-4.1                     amd64        C and C++ client library to the memcached server (development files)
ii  libmemcached11:amd64           1.0.18-4.1                     amd64        C and C++ client library to the memcached server
ii  libmemcachedutil2:amd64        1.0.18-4.1                     amd64        library implementing connection pooling for libmemcached
ii  memcached                      1.4.33-1                       amd64        high-performance memory object caching system
ii  memcached-dbgsym               1.4.33-1                       amd64        Debug symbols for memcached
root@cid1006:/var/log# addr2line -e /usr/bin/memcached 14bdd
./assoc.c:230
root@cid1006:/var/log# addr2line -e /usr/bin/memcached 000055e1e64d8bdd
??:0
root@cid1006:/var/log# gdb /usr/bin/memcached 
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/memcached...Reading symbols from /usr/lib/debug/.build-id/0f/00482ef25da1c88117f24a2326795273b7a7ec.debug...done.
done.
(gdb) info symbol 0x000055e1e64d8bdd
No symbol matches 0x000055e1e64d8bdd.
(gdb) info symbol 0x14bdd
assoc_maintenance_thread + 205 in section .text of /usr/bin/memcached

0x14bdd is ip -> 000055e1e64d8bdd minus 55e1e64c4000

dormando 2017-10-9
14

人力资源.这条线将是相当有关。

任何机会, 我可以得到的所有的统计数据, 从任何实例, 你在该池目前已满, 并退出?我可以设置一个重现 (需要 stats | stats items | stats slabs | stats settings )

原文:

hrm. that line would be pretty concerning.

any chance I could get that full set of stats from any instance you have in that pool which is presently full and evicting? I can set up a repro (need stats | stats items | stats slabs | stats settings)

farshidce 2017-10-9
15

这些统计数据是从1.5.0 安装, 从1.4.33 转换后, 它坠毁几个星期前
https://gist.github.com/farshidce/8df1529762440eefd7666b0aa6e521f0

在具有相同读写访问模式的群集中仍然很少有 memcached 1.4.33, 它们没有崩溃。唯一的区别是那些使用120GB 或 240GB RAM。不知道它是否相关, 但只是以防万一, 这可能有助于调试问题。

原文:

these stats are from 1.5.0 installation that was converted from 1.4.33 after it crashed few weeks ago
https://gist.github.com/farshidce/8df1529762440eefd7666b0aa6e521f0

There are still few memcached 1.4.33 in the cluster with the same read/write access pattern and they had not crashed . The only difference is that those use 120GB or 240GB RAM. Not sure if its relevant but just in case this could help in debugging the issue.

dormando 2017-10-9
16

0 cmd_get?该实例是否正在使用中?

是的, 我有一个关于内存使用的怀疑;这就是为什么我要求的统计数据从一个正常的实例, 这是接近充分。有任何一个坠毁的 1.4. 33 的与360g 灌装或几乎重新灌装?

原文:

0 cmd_get? is that instance in use?

and yeah I have a suspicion about memory usage; that's why I was asking for stats from a normal instance that's near full. Are any of the crashed 1.4.33's with 360g refilled or nearly refilled?

farshidce 2017-10-9
17

这 mc (1.5.0) 已经被使用, 因为它被重启由于崩溃几天前

https://gist.github.com/farshidce/2092ab52af08fb018a8f052b94470c90

我也有 "统计" 从所有的1.5.0 收集, 所以万一他们崩溃, 我可以提交他们的日志, 以防崩溃再次发生。如果它有助于我也可以发送日志时, 驱逐开始在那些 360 GB memcahed 实例。

原文:

this mc ( 1.5.0 ) has been used since it was restarted due to the crash few days ago

https://gist.github.com/farshidce/2092ab52af08fb018a8f052b94470c90

i also have the "stats" being collected from all 1.5.0 so in case they crash i can submit their logs in case crash occurs again. if it helps i can also send the logs when eviction starts on those 360 GB memcahed instances.

dormando 2017-10-9
18

你没有任何完整的 1.4. 33 是在 360G, 你可以只是远程登录和抓取的统计数据?

我有一种感觉是这样的: dormando/memcached@c5b3ecb

分行:
https://github.com/dormando/memcached/tree/hashoverwhelming

hashmasks 是 32bit, 但一些 pre-x86-64 哈希表大小的代码认为它的4字节, 但实际上是8字节的64bit 机。它会溢出和打破任何方式虽然。我看到在你的统计输出, 几乎所有的项目都坐在96b 类。

我实际上没有任何机器的手, 可以重现这一点, 因为我需要大约300演唱会, 即使我削减了最小板类的大小下至64b。

我仍在审核代码, 以查看是否在存储了 2 ^ 32 项之后是否有任何其他中断, 但我漂亮确定了所有内容, 但现在已使哈希表本身处于64bit 安全的位置。从一个完整的/退出实例的一些统计结果将有助于确认, 当我设计一些测试。

原文:

You don't have any full 1.4.33's at 360G that you could just telnet to and grab the stats from?

I have a feeling it's this: dormando/memcached@c5b3ecb

branch here:
https://github.com/dormando/memcached/tree/hashoverwhelming

hashmasks are 32bit but some pre-x86-64 code for hash table sizing thinks it's 4 bytes but is actually 8 bytes on a 64bit machine. It would've overflowed and broken either way though. I see in your stats output that almost all items sit in the 96b class.

I don't actually have any machines on hand that can repro this since I need around 300 gigs even if I cut the min slab class size down to 64b.

I'm still auditing the code to see if anything else breaks after 2^32 items are stored, but I'm pretty sure everything but the hash table itself was made 64bit safe by now. Some stats output from a full/evicting instance would help confirm that while I devise some tests.

farshidce 2017-10-9
19

我可以将现有的 1.5 memcached 实例中的一个降级为 1.4.33, 等待几天直到驱逐开始。这可能需要几天的时间, 虽然
根据你的理论, 如果实例上有 cotnious 的 set 命令直到逐出开始, 实例就有必要来获取或足够。

如果它能帮助我也可以发送的统计数据从其他 memcached 使用少于250gb 内存, 他们还没有崩溃。

原文:

I can downgrade one of the existing 1.5 memcached instances to 1.4.33 and wait few days until eviction starts . This could take few days though
Based on your theory is it necessary for the instance to have come-get or is it sufficient if there are cotnious set commands on the instance until eviction starts.

If it helps I can also send the stats from other memcached that use less than 250gb memory and they had not crashed yet .

dormando 2017-10-9
20

不要麻烦降级。对不起, 我还以为你还有 360 1.4.33 呢。< 250 的一个将是罚款, 我可以画一条线的基础上的数字在这些统计。

集只应该能够重现, 但由于列表扩展的方式, 你不能只是洪水与设置从一个基准。必须变得更自然一些。虽然您可以将 -o hashpower=32 添加到启动选项中, 然后将其淹没 (或31?

原文:

Don't bother downgrading. Sorry, I thought you had some 360's at 1.4.33 still. One of the < 250's would be fine, I can draw a line based on the numbers in those stats.

Sets only should be able to repro, but due to the way hashtables expand you can't just flood it with sets from a benchmark. has to grow a bit more naturally. Though you can add -o hashpower=32 to the startup options then flood it (or 31?).

farshidce 2017-10-9
21

还行。此外, 这里的统计数据从 memcached 与1.4.33 使用较少的 RAM, 并没有坠毁

https://gist.github.com/farshidce/7af244044d833694845576d8358b7f2b

原文:

ok. also , here are the stats from a memcached with 1.4.33 that uses less RAM and had not crashed

https://gist.github.com/farshidce/7af244044d833694845576d8358b7f2b

dormando 2017-10-9
22

谢谢!我觉得我的补丁可能会做, 如果你能抽出一台机器尝试它。

也只是出于好奇;在得到之前, 这些填充量是否为一定数量, 或者您的用例是如何工作的?如果你不介意我问:)

原文:

Thanks! I feel like my patch might do it, if you can spare a machine to try it on.

Also just out of curiosity; do these fill up to some amount before getting gets, or how does your use case work? If you don't mind me asking :)

dormando 2017-10-9
23

也为它的价值;您可能会注意到一些轻微的减速, 但我不认为您的 ram 容量足够高, 导致严重的问题。特别是在1.5 启用 murmurhash。

获得64bit 列表支持是在 TODO 列表后面的几大批工作, 虽然。因此, 也许今年晚些时候, 或明年初, 并希望在1TB 实例成为常见:)在64bit 计算机上, 许多64bit 哈希比它们的32bit 等效项更快。

原文:

Also for what it's worth; you might notice some mild slowdown, but I don't think your ram capacity is high enough to cause a severe problem. Especially with murmurhash enabled in 1.5.

Getting 64bit hashtables supported is on the TODO list behind a few more big batches of work though. So maybe late this year, or early next, and hopefully before 1TB instances become common :) On 64bit machines many 64bit hashes are faster than their 32bit equivalents anyway.

farshidce 2017-10-9
24

是的, 在 cmd_get 通信开始之前, 我们会等待缓存的容量在50% 左右。

我可以应用补丁一旦我想出如何重建它使用 debian 脚本。只是为了确保我知道我应该将此修补程序应用于1.5.0 安装?

原文:

yes, we do wait until the cache is at 50% or so of its capacity before the cmd_get traffic starts.

i can apply the patch once i figure out how to re-build it using debian scripts. just to make sure i understand should i apply this patch to the 1.5.0 installations ?

dormando 2017-10-9
25

是的, 1.5.0。该补丁是在1.5.1 的顶部, 但它应该适用。如果您正在获取 debian 源目录, 则交换源文件或添加补丁文件相对微不足道。

原文:

Yeah, the 1.5.0. The patch is on top of 1.5.1 but it should apply. If you're fetching the debian source directory it's relatively trivial to either swap source files or add a patch file.

farshidce 2017-10-9
26

4 memcmached 运行1.5.0 坠毁, 我认为驱逐尚未开始

这里是他们的 "统计"
https://gist.github.com/farshidce/1543ede69269b5dbc4dfd3d5798c6eed
我确实有他们过去24小时的统计, 如果它能帮助我可以私下发送。

我没有数据板和统计项目, 但他们。

原文:

4 memcmached running 1.5.0 crashed and i think the eviction had not begun

here are the "stats" from them
https://gist.github.com/farshidce/1543ede69269b5dbc4dfd3d5798c6eed
I do have the stats from them for the past 24 hours and if it helps i can send them privately.

i dont have stats slabs and stats items from them though.

dormando 2017-10-9
27

你有 addr2line's 吗?尤其是第三个

原文:

do you have addr2line's? especially of that third one.

dormando 2017-10-9
28

第三个是令人迷惑的;它有零流量。这一数据的输出和崩溃之间的时间增量是多少?

原文:

That third one is confusing; it has zero traffic. What was the time delta between the stats output and crash for that one?

farshidce 2017-10-9
29

因为我没有 dbgsym 建立的 1.5.0 I_1 不知道, 如果我可以信任的 "信息符号"
[6413766.405039] memcached [29289]: 错误在 7f8ae3fff020 ip 000055c239c10ead sp 00007feff8986f30 错误4在 memcached [55c239bfa000 + 26000]

但这就是我得到的

Reading symbols from memcached...done.
(gdb) info symbol 0x16EAD
assoc_delete + 13 in section .text

我有一个 memcached 与 1.5. 0-1 + b4, 如果该实例崩溃, 我可以得到它的信息符号自 1.5. 0.-1b4 dbg 版本可用

原文:

since I don't have the dbgsym build of 1.5.0 I_1 don't know If I can trust the "info symbol"
[6413766.405039] memcached[29289]: segfault at 7f8ae3fff020 ip 000055c239c10ead sp 00007feff8986f30 error 4 in memcached[55c239bfa000+26000]

but this is what I get

Reading symbols from memcached...done.
(gdb) info symbol 0x16EAD
assoc_delete + 13 in section .text

I do have a memcached with 1.5.0-1+b4 and if that instance crashes I can get the info symbols on it since 1.5.0.-1b4 dbg build is available

farshidce 2017-10-9
30

现在就仔细检查一下要点..。

原文:

double checking the gist right now ...

farshidce 2017-10-9
31

我已经上传了三例坠毁的统计数据。在压缩的统计是从过去12小时左右。
https://drive.google.com/file/d/0B5OKKYhgSvqsdzE5akk0MmJveVE/view

请忽略 "第三" 从 ^ 依据。它来自非活动实例。

原文:

I have uploaded stats from three instances that crashed. the stats in the tarball are from the last 12 hours or so .
https://drive.google.com/file/d/0B5OKKYhgSvqsdzE5akk0MmJveVE/view

please ignore the "third" from the ^ gist . it was from an inactive instance.

dormando 2017-10-9
32

谢谢!

因此, 在从 hashpower 31 移动到32时, 3221225472 是完全截止。我看到你给我的所有三的统计记录都在那里死去

我给你的补丁限制到 32..。但不知道为什么32会崩溃。如果你重现一个崩溃与 32, 你可以只修改 HASHPOWER_MAX 值在修补程序到 31, 看看是否稳定。

我会再仔细看看

原文:

thanks!

So, 3221225472 is the exact cutoff when moving from hashpower 31 to 32. I see all three of those stats logs you gave me die right around there.

The patch I gave you limits it to 32... but not sure why 32 would crash. If you repro a crash with 32 you can just modify the HASHPOWER_MAX value in the patch to 31 and see if that stabilizes.

I'll take a closer look when I can.

dormando 2017-10-9
33

在该行得到了 31b-> 32b 哈希扩展 segfaulting 的重现。等待着头痛消散, 我会找出原因。是相当肯定的32b 应该工作。

原文:

got a repro of 31b -> 32b hash expansion segfaulting at that line. waiting for a headache to disperse and I'll figure out why.. was pretty sure 32b should work.

dormando 2017-10-9
34

相信我修好了它: dormando/memcached@2253661
(推送到同一个分支)
幸好微不足道 (到目前为止)。 如果你运行这个分支, 它不应该坠毁, 至少在这个地区没有。

此外, 为了它的价值, 需要很长的时间才能将哈希表扩展到30b 以上。您应该考虑监视 "hashpower" 统计, 并根据它在最近运行中的大小将 -o hashpower=N 添加到启动选项中, 这可能是由每个主机有多少 ram 来细分的。

部分是因为它必须保持旧的哈希表, 随着新的, 而它的扩展, 这在30b 后变得相当大。部分是因为它只是燃烧 CPU 5-15 分钟从31到32。

很惊讶, 没有人得到超过3.2b 项目在 memcached 之前, 鉴于这个 bug 是11岁!一定会发生的某一天。

原文:

Believe I fixed it: dormando/memcached@2253661
(pushed to the same branch as before)
thankfully trivial (so far). If you run this branch it shouldn't crash, at least not in this area.

Also, for what it's worth, it takes a very long time to expand the hash table once above 30b. You should considering monitoring the 'hashpower' stat and adding -o hashpower=N to the startup options based on how large it got in recent runs, probably broken down by how much ram each host has.

Partly because it has to keep the old hash table around along with the new one while it's expanding, which gets quite large after 30b. Partly because it just burns CPU for 5-15 minutes going from 31 to 32.

Pretty surprised nobody had gotten more than 3.2b items in memcached before, given this bug is 11 years old! Bound to happen some day.

dormando 2017-10-9
35

有什么好运气吗?

原文:

Any luck?

farshidce 2017-10-9
36

还没有机会申请补丁, 因为我的 Debian 技能不是很大的建设从源头。不过我应该能在几天内报告回来此外, 缓存现在使用240GB 和驱逐没有踢, 但最后我检查。

原文:

haven't had a chance to apply the patch yet since my Debian skills aren't that great in building from source. I should be able to report back in few days though. also the caches are now using 240GB and eviction has not kicked in yet last I checked.

farshidce 2017-10-9
37

我已经部署了两天前修补的版本, 并将内存恢复到 360 GB。
我应该能够确认是否在一周内再次发生坠机事件。

原文:

I have deployed the patched version couple of days ago and reverted the memory back to 360 GB.
I should be able to confirm whether crash occurs again or not in a week.

dormando 2017-10-9
38

很高兴听你这么说我需要马上再放一份。

原文:

good to hear. I need to cut another release soon.

dormando 2017-10-9
39

怎么样?:)如果你已经设法超过3.2b 对象, 我应该在这个周末发布。

原文:

how'd it go? :) I should release this this weekend, if you've managed to go over 3.2b objects.

farshidce 2017-10-9
40

现在没有 curr_items 25亿了。但它应该在不到一周的时间内到达那里。如果有更多的 "获取" 或 curr_items 是触发崩溃的唯一途径, 这有帮助吗?

原文:

no curr_items is at 2.5 billion now . It should get there in less than a week though. does it help if there are more "gets" or is curr_items the only way to trigger the crash ?

dormando 2017-10-9
41

curr_items 是唯一的出路。

原文:

curr_items is the only way.

farshidce 2017-10-9
42

现在已经超过 3.2, 但还没有被驱逐。 一旦它开始驱逐, 我也将更新线程。

STAT connection_structures 212
STAT reserved_fds 20
STAT cmd_get 0
STAT cmd_set 14413086019
STAT cmd_flush 0
STAT cmd_touch 0
STAT get_hits 0
STAT get_misses 0
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 1995118762855
STAT bytes_written 56514914505
STAT limit_maxbytes 377487360000
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 50712660
STAT hash_power_level 32
STAT hash_bytes 34359738368
STAT hash_is_expanding 0
STAT slab_reassign_rescues 4031951
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 632
STAT slab_reassign_busy_items 55
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 1244
STAT lru_crawler_running 1
STAT lru_crawler_starts 92185
STAT lru_maintainer_juggles 4459561986
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT bytes 345601420404
STAT curr_items 3645773016
STAT total_items 14417117970
STAT slab_global_page_pool 0
STAT expired_unfetched 0
STAT evicted_unfetched 0
STAT evicted_active 0
STAT evictions 0
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 563440512109
STAT lrutail_reflocked 0
STAT moves_to_cold 6473590922
STAT moves_to_warm 0
STAT moves_within_lru 0
STAT direct_reclaims 0
STAT lru_bumps_dropped 0```
原文:

it has surpassed 3.2 now but no eviction yet. once it starts eviction i will update the thread as well.

STAT connection_structures 212
STAT reserved_fds 20
STAT cmd_get 0
STAT cmd_set 14413086019
STAT cmd_flush 0
STAT cmd_touch 0
STAT get_hits 0
STAT get_misses 0
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 1995118762855
STAT bytes_written 56514914505
STAT limit_maxbytes 377487360000
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 50712660
STAT hash_power_level 32
STAT hash_bytes 34359738368
STAT hash_is_expanding 0
STAT slab_reassign_rescues 4031951
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 632
STAT slab_reassign_busy_items 55
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 1244
STAT lru_crawler_running 1
STAT lru_crawler_starts 92185
STAT lru_maintainer_juggles 4459561986
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT bytes 345601420404
STAT curr_items 3645773016
STAT total_items 14417117970
STAT slab_global_page_pool 0
STAT expired_unfetched 0
STAT evicted_unfetched 0
STAT evicted_active 0
STAT evictions 0
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 563440512109
STAT lrutail_reflocked 0
STAT moves_to_cold 6473590922
STAT moves_to_warm 0
STAT moves_within_lru 0
STAT direct_reclaims 0
STAT lru_bumps_dropped 0```
dormando 2017-10-9
43

hash_power_level 32优秀.看起来它成功地展开了 (除非你开始这么大?如果可以的话, 我会在这个周末发布一份新闻稿。

感谢您的耐心, 让这想通了!抱歉我把虫子留了这么久我会把这个打开, 直到你报告成功的搬迁:P

原文:

hash_power_level 32 - excellent. looks like it successfully expanded (unless you started it that large?). I'll cut a release this weekend if I can regardless.

Thanks for your patience in getting this figured out! Sorry I left that bug for so long. I'll leave this open until you report successful evictions :P

farshidce 2017-10-9
44

持续驱逐几天, 并没有崩溃到目前为止:)

统计 auth_errors 0
统计 bytes_read 2342629179875
统计 bytes_written 66338251139
统计 limit_maxbytes 377487360000
统计 accepting_conns 1
统计 listen_disabled_num 0
统计 time_in_listen_disabled_us 0
统计线程4
统计 conn_yields 59541931
统计 hash_power_level 32
统计 hash_bytes 34359738368
统计 hash_is_expanding 0
统计 slab_reassign_rescues 5025597
统计 slab_reassign_chunk_rescues 0
统计 slab_reassign_evictions_nomem 615597
统计 slab_reassign_inline_reclaim 443100
统计 slab_reassign_busy_items 615687
统计 slab_reassign_busy_deletes 0
统计 slab_reassign_running 0
统计 slabs_moved 1604
统计 lru_crawler_running 1
统计 lru_crawler_starts 106429
统计 lru_maintainer_juggles 5211087955
统计 malloc_fails 0
统计 log_worker_dropped 0
统计 log_worker_written 0
统计 log_watcher_skipped 0
统计 log_watcher_sent 0
统计字节349840389083
统计 curr_items 3693186283
统计 total_items 16925237992
统计 slab_global_page_pool 0
统计 expired_unfetched 0
统计 evicted_unfetched 408186948
统计 evicted_active 0
统计驱逐408186948
统计回收0
统计 crawler_reclaimed 0
统计 crawler_items_checked 713096544837
统计 lrutail_reflocked 0
统计 moves_to_cold 7594040314
统计 moves_to_warm 0
统计 moves_within_lru 0
统计 direct_reclaims 408186948
统计 lru_bumps_dropped 0
结束

原文:

ongoing evictions for few days and no crash so far :)

STAT auth_errors 0
STAT bytes_read 2342629179875
STAT bytes_written 66338251139
STAT limit_maxbytes 377487360000
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 59541931
STAT hash_power_level 32
STAT hash_bytes 34359738368
STAT hash_is_expanding 0
STAT slab_reassign_rescues 5025597
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 615597
STAT slab_reassign_inline_reclaim 443100
STAT slab_reassign_busy_items 615687
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 1604
STAT lru_crawler_running 1
STAT lru_crawler_starts 106429
STAT lru_maintainer_juggles 5211087955
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT bytes 349840389083
STAT curr_items 3693186283
STAT total_items 16925237992
STAT slab_global_page_pool 0
STAT expired_unfetched 0
STAT evicted_unfetched 408186948
STAT evicted_active 0
STAT evictions 408186948
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 713096544837
STAT lrutail_reflocked 0
STAT moves_to_cold 7594040314
STAT moves_to_warm 0
STAT moves_within_lru 0
STAT direct_reclaims 408186948
STAT lru_bumps_dropped 0
END

dormando 2017-10-9
45

这是很多项目。感谢您的验证!周末我没有得到释放, 但很快就会发生。无论如何, 我应该先堆几个小补丁。

原文:

that is a lot of items. thanks for verifying! I didn't get a release over the weekend but one will happen soon. I should stack a few more small fixes in first anyway.

dormando 2017-10-9
46

发布在1.5.2。再次感谢

原文:

released in 1.5.2. thanks again

返回
发表文章
farshidce
文章数
1
评论数
19
注册排名
60699