boards > tech > Ouchies -- amd64 powernow-k8

ownericonuserlame
Joined: 2004-03-17
Posts: 653
2008.02.19 - 00:14:29 PDT
I hit a stupid bug that should not have been happening (I didn't mean for this to be included in my kernel). When patching and rebuilding kernels to fix an awfully scary local exploit [userlame.com], I left cpu frequency scaling enabled including the powernow-k8 governor. These were domain 0 kernels for xen VM hosts, and when under heavy load in the domU, it would hit a general protection fault which apparently comes from this governor. Here's a couple examples of hitting the bug:

b296 login: general protection fault: 0000 [1] SMP
CPU 0
Pid: 15747, comm: x86_64-pc-linux Not tainted 2.6.20-xen-r6-domU #10
RIP: e030:[<ffffffff802564dc>] [<ffffffff802564dc>] vma_prio_tree_remove+0x45/0xe1
RSP: e02b:ffff88000d1a7e48 EFLAGS: 00010206
RAX: 00002b0f019b9000 RBX: 0004000000000000 RCX: 00002b0f017b9000
RDX: ffff880018c77820 RSI: ffff880018c77840 RDI: ffff88000b8f6480
RBP: ffff880000198980 R08: ffff880018c77840 R09: ffff880001280480
R10: 0000000000007ff0 R11: ffffffff80219858 R12: ffff88000b8f6480
R13: 00002b0f01790000 R14: 0000000000000000 R15: 0000000000000000
FS: 00002b0f01ef2dd0(0000) GS:ffffffff8072a000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 0000000000000660
Process x86_64-pc-linux (pid: 15747, threadinfo ffff88000d1a6000, task ffff88000c32b510)
Stack: 00000000005c3fff ffff880018c77820 ffff880000198980 ffffffff8026027e
ffff88000b85abb8 ffff88000b8f6480 ffff88000b85abb8 ffffffff8025ec75
ffff88000d1a7ed0 ffff88000d1a7ed0 ffff88000c85f7c8 ffff88001c20d7c0
Call Trace:
[<ffffffff8026027e>] unlink_file_vma+0x31/0x3d
[<ffffffff8025ec75>] free_pgtables+0x69/0x99
[<ffffffff8025ff63>] exit_mmap+0x96/0xf0
[<ffffffff8022b1c6>] mmput+0x28/0x98
[<ffffffff802309c3>] do_exit+0x23b/0x81b
[<ffffffff80231044>] sys_exit_group+0x0/0xe
[<ffffffff80209d96>] system_call+0x86/0x8b
[<ffffffff80209d10>] system_call+0x0/0x8b


Code: 48 39 7b 60 74 04 0f 0b eb fe 48 83 7f 58 00 74 42 48 8b 4b
RIP [<ffffffff802564dc>] vma_prio_tree_remove+0x45/0xe1
RSP <ffff88000d1a7e48>
<1>Fixing recursive fault but reboot is needed!
BUG: soft lockup detected on CPU#0!

Call Trace:
<IRQ> [<ffffffff8024a168>] softlockup_tick+0xdb/0xed
[<ffffffff8020e635>] timer_interrupt+0x3a0/0x3ff
[<ffffffff8024a44d>] handle_IRQ_event+0x47/0x8e
[<ffffffff8024bb1a>] handle_level_irq+0xae/0x100
[<ffffffff8020c8d1>] do_IRQ+0xde/0x149
[<ffffffff8047fe58>] evtchn_do_upcall+0xac/0x12d
[<ffffffff8020a48e>] do_hypervisor_callback+0x1e/0x30
<EOI> [<ffffffff805b0f1a>] _spin_lock+0x7/0xf
[<ffffffff80260ccb>] vma_link+0x42/0x107
[<ffffffff80261a41>] do_mmap_pgoff+0x58c/0x74b
[<ffffffff802741f5>] sys_newfstat+0x20/0x29
[<ffffffff8020ff5a>] sys_mmap+0x8b/0x110
[<ffffffff80209d96>] system_call+0x86/0x8b
[<ffffffff80209d10>] system_call+0x0/0x8b


bob296 login: general protection fault: 0000 [1] SMP
CPU 0
Pid: 2234, comm: sed Not tainted 2.6.20-xen-r6-domU #10
RIP: e030:[<ffffffff802564dc>] [<ffffffff802564dc>] vma_prio_tree_remove+0x45/0xe1
RSP: e02b:ffff88000d1b1e48 EFLAGS: 00010206
RAX: 00002af1c4a58000 RBX: 0004000000000000 RCX: 00002af1c4858000
RDX: ffff880016e7db40 RSI: ffff880016e7db60 RDI: ffff880011a77480
RBP: ffff88000002ad80 R08: ffff880016e7db60 R09: ffff88000132edb8
R10: 0000000000007ff0 R11: ffffffff80219858 R12: ffff880011a77480
R13: 00002af1c482f000 R14: 0000000000000000 R15: 0000000000000000
FS: 00002af1c4f91dd0(0000) GS:ffffffff8072a000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 0000000000000660
Process sed (pid: 2234, threadinfo ffff88000d1b0000, task ffff880001b88380)
Stack: 0000000000543fff ffff880016e7db40 ffff88000002ad80 ffffffff8026027e
ffff880011ea6918 ffff880011a77480 ffff880011ea6918 ffffffff8025ec75
ffff88000d1b1ed0 ffff88000d1b1ed0 ffff880011869d08 ffff8800113f5c40
Call Trace:
[<ffffffff8026027e>] unlink_file_vma+0x31/0x3d
[<ffffffff8025ec75>] free_pgtables+0x69/0x99
[<ffffffff8025ff63>] exit_mmap+0x96/0xf0
[<ffffffff8022b1c6>] mmput+0x28/0x98
[<ffffffff802309c3>] do_exit+0x23b/0x81b
[<ffffffff80231044>] sys_exit_group+0x0/0xe
[<ffffffff80209d96>] system_call+0x86/0x8b
[<ffffffff80209d10>] system_call+0x0/0x8b


Code: 48 39 7b 60 74 04 0f 0b eb fe 48 83 7f 58 00 74 42 48 8b 4b
RIP [<ffffffff802564dc>] vma_prio_tree_remove+0x45/0xe1
RSP <ffff88000d1b1e48>
<1>Fixing recursive fault but reboot is needed!
BUG: soft lockup detected on CPU#0!

Call Trace:
<IRQ> [<ffffffff8024a168>] softlockup_tick+0xdb/0xed
[<ffffffff8020e635>] timer_interrupt+0x3a0/0x3ff
[<ffffffff80236595>] run_timer_softirq+0x3c/0x1ed
[<ffffffff8024a44d>] handle_IRQ_event+0x47/0x8e
[<ffffffff8024bb1a>] handle_level_irq+0xae/0x100
[<ffffffff8020c8d1>] do_IRQ+0xde/0x149
[<ffffffff8047fe58>] evtchn_do_upcall+0xac/0x12d
[<ffffffff8020a48e>] do_hypervisor_callback+0x1e/0x30
<EOI> [<ffffffff805b0f1a>] _spin_lock+0x7/0xf
[<ffffffff8022bc96>] copy_process+0xa60/0x1757
[<ffffffff8022cbb4>] do_fork+0xb1/0x1a9
[<ffffffff80209d96>] system_call+0x86/0x8b
[<ffffffff8020a18d>] ptregscall_common+0x3d/0x70


And at that point, I'd have to xm destroy the domains and re-create them. Some googling of "general protection fault" and "vma_prio_tree_remove" pointed directly at powernow-k8. I went and checked the dom0 kernels, and sure enough CPU_FREQ was enabled and powernow-k8 was in there.

Recompiled and about to restart the machines. I really hope this helps. The domU is reproducibly crashing when running an emerge and it really sucks. It crashes and dumps stack trace from different programs including so far sed, ebuild.sh and x86_64-pc-linux. frown.gif
ownericonuserlame
Joined: 2004-03-17
Posts: 653
2008.04.25 - 13:40:49 PDT
Looks like this wasn't actually caused by powernow. I'm still able to reproduce it pretty consistently after removing CPU_FREQ entirely.

However, I am completely dumb and found some new information that helps a lot. It looks like this is only occurring on one of my VM hosts. That is some great freakin news. That means I can get some memtest86 going and track down the bad memory (hopefully).

I think I lucked out here.
hit counter