LKML Archive mirror
 help / color / mirror / Atom feed
* [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
@ 2004-02-04 23:17 Martin J. Bligh
  2004-02-04 23:58 ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-04 23:17 UTC (permalink / raw
  To: linux-kernel; +Cc: linux-mm mailing list, kmannth

http://bugme.osdl.org/show_bug.cgi?id=2019

           Summary: Bug from the mm subsystem involving X
    Kernel Version: kernel.org 2.6.2
            Status: NEW
          Severity: normal
             Owner: mm_numa-discontigmem@kernel-bugs.osdl.org
         Submitter: kmannth@us.ibm.com


Distribution:  Red Hat Enterprise Linux AS release 3 (Taroon Update 1)
Hardware Environment:  IBM x445 16-way 64gig of ram
Software Environment:  AS3.0 update 1 with stock 2.6.2
Problem Description:   The X server and the kenel do not play well.

Steps to reproduce:   Load AS3.0 (any flavor) and install a v2.6 kernel
start X on boot. 

So there have been alot of X issue with Red Hat and 2.6 kernels.  I managed to
get the system to panic and I decide it was time to open this bug.  I got this
on boot up. 

NET: Registered protocol family 17
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 296k freed
???????
Red Hat Enterprise Linux AS release 3 (Taroon Update 1)
Kernel 2.6.2 on an i686

elm3a80 login: Unable to handle kernel paging request at virtual address 0264d000
 printing eip:
c0147af4
*pde = 00000000
Oops: 0000 [#1]
CPU:    7
EIP:    0060:[<c0147af4>]    Not tainted
EFLAGS: 00013206
EIP is at remap_page_range+0x193/0x26c
eax: 0264d000   ebx: 000f5200   ecx: 00000001   edx: dad0fa80
esi: 001fe000   edi: d87c9ff0   ebp: f5200000   esp: d8835ee4
ds: 007b   es: 007b   ss: 0068
Process X (pid: 1285, threadinfo=d8834000 task=d9474ce0)
Stack: d961d580 001ff000 001ff000 40000000 f5002000 001fe000 d9578000 d961d580
       401ff000 d9576508 00000000 f5200000 d961d580 00000001 c0247055 d87d62c0
       401fe000 b5002000 00001000 00000027 d9388e80 00001000 c014a7fd d9388e80
Call Trace:
 [<c0247055>] mmap_mem+0x71/0xd4
 [<c014a7fd>] do_mmap_pgoff+0x362/0x70d
 [<c0156f65>] filp_open+0x67/0x69
 [<c0111c4d>] sys_mmap2+0x7a/0xaa
 [<c010aced>] sysenter_past_esp+0x52/0x71

Code: 8b 00 a9 00 08 00 00 74 10 89 d8 8b 54 24 4c c1 e8 14 09 ea
 <6>note: X[1285] exited with preempt_count 1
bad: scheduling while atomic!
Call Trace:
 [<c011da0a>] schedule+0x6d0/0x6d5
 [<c0122357>] __call_console_drivers+0x5b/0x5d
 [<c0122449>] call_console_drivers+0x69/0x11f
 [<c0223ffb>] rwsem_down_read_failed+0xa7/0x15a
 [<c012513f>] .text.lock.exit+0xeb/0x18c
 [<c010be11>] do_divide_error+0x0/0xfb
 [<c011a06f>] do_page_fault+0x1f8/0x561
 [<c0138b93>] find_get_page+0x3d/0x7a
 [<c0139db6>] filemap_nopage+0x287/0x378
 [<c013b166>] generic_file_aio_write+0x78/0xa2
 [<c0119e77>] do_page_fault+0x0/0x561
 [<c010b7a9>] error_code+0x2d/0x38
 [<c0147af4>] remap_page_range+0x193/0x26c
 [<c0247055>] mmap_mem+0x71/0xd4
 [<c014a7fd>] do_mmap_pgoff+0x362/0x70d
 [<c0156f65>] filp_open+0x67/0x69
 [<c0111c4d>] sys_mmap2+0x7a/0xaa
 [<c010aced>] sysenter_past_esp+0x52/0x71


Red Hat Enterprise Linux AS release 3 (Taroon Update 1)
Kernel 2.6.2 on an i686


My X version is XFree86-4.3.0-44.EL

Also if I do proc related thing on the pid (ps top ...) I hang the login session
(strace shows I don't return from a read on what I suppose is the X pid)

Any thoughts, comments or suggestions are wanted.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-04 23:17 [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X (fwd) Martin J. Bligh
@ 2004-02-04 23:58 ` Linus Torvalds
  2004-02-05  0:12   ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2004-02-04 23:58 UTC (permalink / raw
  To: Martin J. Bligh; +Cc: linux-kernel, linux-mm mailing list, kmannth



On Wed, 4 Feb 2004, Martin J. Bligh wrote:
> 
> So there have been alot of X issue with Red Hat and 2.6 kernels.  I managed to
> get the system to panic and I decide it was time to open this bug.  I got this
> on boot up. 

Hmm. Compiler? Why would AS-3 in particular have problems?

> Unable to handle kernel paging request at virtual address 0264d000
>  printing eip:
> c0147af4
> *pde = 00000000
> Oops: 0000 [#1]
> CPU:    7
> EIP:    0060:[<c0147af4>]    Not tainted
> EFLAGS: 00013206
> EIP is at remap_page_range+0x193/0x26c
> eax: 0264d000   ebx: 000f5200   ecx: 00000001   edx: dad0fa80
> esi: 001fe000   edi: d87c9ff0   ebp: f5200000   esp: d8835ee4
> ds: 007b   es: 007b   ss: 0068
> Process X (pid: 1285, threadinfo=d8834000 task=d9474ce0)
> Stack: d961d580 001ff000 001ff000 40000000 f5002000 001fe000 d9578000 d961d580
>        401ff000 d9576508 00000000 f5200000 d961d580 00000001 c0247055 d87d62c0
>        401fe000 b5002000 00001000 00000027 d9388e80 00001000 c014a7fd d9388e80
> Call Trace:
>  [<c0247055>] mmap_mem+0x71/0xd4
>  [<c014a7fd>] do_mmap_pgoff+0x362/0x70d
>  [<c0156f65>] filp_open+0x67/0x69
>  [<c0111c4d>] sys_mmap2+0x7a/0xaa
>  [<c010aced>] sysenter_past_esp+0x52/0x71
> 
> Code: 8b 00 a9 00 08 00 00 74 10 89 d8 8b 54 24 4c c1 e8 14 09 ea

This _seems_ to be the code

		...
                if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn)))
                        set_pte(pte, pfn_pte(pfn, prot));
		...

in particular, it disassembles to

	0x8048490 <insn>:       mov    (%eax),%eax
	0x8048492 <insn+2>:     test   $0x800,%eax
	0x8048497 <insn+7>:     je     0x80484a9
	0x8048499 <insn+9>:     mov    %ebx,%eax
	0x804849b <insn+11>:    mov    0x4c(%esp,1),%edx
	0x804849f <insn+15>:    shr    $0x14,%eax

which seems to be the "PageReserved(pfn_to_page(pfn))" test.

This implies that you have either:
 - a buggy "pfn_valid()" macro (do you use CONFIG_DISCONTIGMEM?)
 - or a buggy compiler (it sure ain't the compiler I use, since that one 
   will generate a "testb $8,%ah" instead)

It might help if you disassembled the code (in your kernel) around that 
point, since that might give a clue about it.

Quite honestly, to me it looks like the address being remapped is likely 
in %ebp (0xf5200000), and that pfn is in %ebx (0x000f5200), and that your 
pfn_valid() is buggered, causing a totally bogus "struct page *" from 
"pfn_to_page()".

		Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-04 23:58 ` Linus Torvalds
@ 2004-02-05  0:12   ` Martin J. Bligh
  2004-02-05  0:36     ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-05  0:12 UTC (permalink / raw
  To: Linus Torvalds; +Cc: linux-kernel, linux-mm mailing list, kmannth

>> So there have been alot of X issue with Red Hat and 2.6 kernels.  I managed to
>> get the system to panic and I decide it was time to open this bug.  I got this
>> on boot up. 
> 
> Hmm. Compiler? Why would AS-3 in particular have problems?

I think it's more likely the combination of NUMA and X. People hardly
ever run X on the big servers ... Keith is just odd ;-)
 
>> Unable to handle kernel paging request at virtual address 0264d000
>>  printing eip:
>> c0147af4
>> *pde = 00000000
>> Oops: 0000 [#1]
>> CPU:    7
>> EIP:    0060:[<c0147af4>]    Not tainted
>> EFLAGS: 00013206
>> EIP is at remap_page_range+0x193/0x26c
>> eax: 0264d000   ebx: 000f5200   ecx: 00000001   edx: dad0fa80
>> esi: 001fe000   edi: d87c9ff0   ebp: f5200000   esp: d8835ee4
>> ds: 007b   es: 007b   ss: 0068
>> Process X (pid: 1285, threadinfo=d8834000 task=d9474ce0)
>> Stack: d961d580 001ff000 001ff000 40000000 f5002000 001fe000 d9578000 d961d580
>>        401ff000 d9576508 00000000 f5200000 d961d580 00000001 c0247055 d87d62c0
>>        401fe000 b5002000 00001000 00000027 d9388e80 00001000 c014a7fd d9388e80
>> Call Trace:
>>  [<c0247055>] mmap_mem+0x71/0xd4
>>  [<c014a7fd>] do_mmap_pgoff+0x362/0x70d
>>  [<c0156f65>] filp_open+0x67/0x69
>>  [<c0111c4d>] sys_mmap2+0x7a/0xaa
>>  [<c010aced>] sysenter_past_esp+0x52/0x71
>> 
>> Code: 8b 00 a9 00 08 00 00 74 10 89 d8 8b 54 24 4c c1 e8 14 09 ea
> 
> This _seems_ to be the code
> 
> 		...
>                 if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn)))
>                         set_pte(pte, pfn_pte(pfn, prot));
> 		...
> 
> in particular, it disassembles to
> 
> 	0x8048490 <insn>:       mov    (%eax),%eax
> 	0x8048492 <insn+2>:     test   $0x800,%eax
> 	0x8048497 <insn+7>:     je     0x80484a9
> 	0x8048499 <insn+9>:     mov    %ebx,%eax
> 	0x804849b <insn+11>:    mov    0x4c(%esp,1),%edx
> 	0x804849f <insn+15>:    shr    $0x14,%eax
> 
> which seems to be the "PageReserved(pfn_to_page(pfn))" test.
> 
> This implies that you have either:
>  - a buggy "pfn_valid()" macro (do you use CONFIG_DISCONTIGMEM?)

Yup.
#define pfn_valid(pfn)          ((pfn) < num_physpages)

Which is wrong. There's a even a comment above it that says:

/*
 * pfn_valid should be made as fast as possible, and the current definition
 * is valid for machines that are NUMA, but still contiguous, which is what
 * is currently supported. A more generalised, but slower definition would
 * be something like this - mbligh:
 * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) )
 */

;-)

Which I still don't think is correct, as there's a hole in the middle of
node 0 ... I'll make a new patch up somehow and give to Keith to test ;-)

Thanks,

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  0:12   ` Martin J. Bligh
@ 2004-02-05  0:36     ` Martin J. Bligh
  2004-02-05  0:43       ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-05  0:36 UTC (permalink / raw
  To: Linus Torvalds
  Cc: linux-kernel, linux-mm mailing list, kmannth, Andrew Morton


>> which seems to be the "PageReserved(pfn_to_page(pfn))" test.
>> 
>> This implies that you have either:
>>  - a buggy "pfn_valid()" macro (do you use CONFIG_DISCONTIGMEM?)
> 
> Yup.
># define pfn_valid(pfn)          ((pfn) < num_physpages)
> 
> Which is wrong. There's a even a comment above it that says:
> 
> /*
>  * pfn_valid should be made as fast as possible, and the current definition
>  * is valid for machines that are NUMA, but still contiguous, which is what
>  * is currently supported. A more generalised, but slower definition would
>  * be something like this - mbligh:
>  * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) )
>  */
> 
> ;-)
> 
> Which I still don't think is correct, as there's a hole in the middle of
> node 0 ... I'll make a new patch up somehow and give to Keith to test ;-)

Oh hell ... I remember what's wrong with this whole bit. pfn_valid is
used inconsistently in different places, IIRC. Linus / Andrew ... what
do you actually want it to mean? Some things seem to use it to say
"the memory here is valid accessible RAM", some things "there is a 
valid struct page for this pfn". I was aiming for the latter, but a
few other arches seemed to disagree.

Could I get a ruling on this? ;-)

Thanks,

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  0:36     ` Martin J. Bligh
@ 2004-02-05  0:43       ` Linus Torvalds
  2004-02-05  0:56         ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2004-02-05  0:43 UTC (permalink / raw
  To: Martin J. Bligh
  Cc: linux-kernel, linux-mm mailing list, kmannth, Andrew Morton



On Wed, 4 Feb 2004, Martin J. Bligh wrote:
> 
> Oh hell ... I remember what's wrong with this whole bit. pfn_valid is
> used inconsistently in different places, IIRC. Linus / Andrew ... what
> do you actually want it to mean? Some things seem to use it to say
> "the memory here is valid accessible RAM", some things "there is a 
> valid struct page for this pfn". I was aiming for the latter, but a
> few other arches seemed to disagree.
> 
> Could I get a ruling on this? ;-)

It _definitely_ means "there is a valid 'struct page' for this pfn". 

To test for "there is RAM" here, you need to first check that the pfn is
valid, and then you can check what the page type is (usually that would be
PageReserved(), but it could be a highmem check or something like that
too).

		Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  0:43       ` Linus Torvalds
@ 2004-02-05  0:56         ` Andrew Morton
  2004-02-05  1:29           ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2004-02-05  0:56 UTC (permalink / raw
  To: Linus Torvalds; +Cc: mbligh, linux-kernel, linux-mm, kmannth

Linus Torvalds <torvalds@osdl.org> wrote:
>
> 
> 
> On Wed, 4 Feb 2004, Martin J. Bligh wrote:
> > 
> > Oh hell ... I remember what's wrong with this whole bit. pfn_valid is
> > used inconsistently in different places, IIRC. Linus / Andrew ... what
> > do you actually want it to mean? Some things seem to use it to say
> > "the memory here is valid accessible RAM", some things "there is a 
> > valid struct page for this pfn". I was aiming for the latter, but a
> > few other arches seemed to disagree.
> > 
> > Could I get a ruling on this? ;-)
> 
> It _definitely_ means "there is a valid 'struct page' for this pfn". 
> 
> To test for "there is RAM" here, you need to first check that the pfn is
> valid, and then you can check what the page type is (usually that would be
> PageReserved(), but it could be a highmem check or something like that
> too).

pfn_valid() could become quite expensive indeed, and it lies on super-duper
hotpaths.

An alternative which is less conceptually clean but should work in this
case is to mark all vma's which were created by /dev/mem mappings as VM_IO,
and test that in remap_page_range().

The marking of mmap_mem() vma's as VM_IO has been in -mm for four months. 
But I didn't changelog it at the time and I've forgotten why I wrote it
(really).  It's something to do with get_user_pages() against a mapping of
/dev/mem :(

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.2-rc3/2.6.2-rc3-mm1/broken-out/get_user_pages-handle-VM_IO.patch


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  0:56         ` Andrew Morton
@ 2004-02-05  1:29           ` Linus Torvalds
  2004-02-05  1:56             ` Keith Mannthey
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2004-02-05  1:29 UTC (permalink / raw
  To: Andrew Morton; +Cc: mbligh, linux-kernel, linux-mm, kmannth



On Wed, 4 Feb 2004, Andrew Morton wrote:
> 
> pfn_valid() could become quite expensive indeed, and it lies on super-duper
> hotpaths.

Yes. However, sometimes it is the only choice. 

So it does need to be fixed, and if it ends up being a noticeable
perofmance problem, then we can look at the hot-paths one by one and see
if we can avoid using it. We probably can, most of the time.

> An alternative which is less conceptually clean but should work in this
> case is to mark all vma's which were created by /dev/mem mappings as VM_IO,
> and test that in remap_page_range().

Hmm.. Grepping for "pfn_valid()", I'm starting to suspect that yes, with a
VM_IO approach and a fixed virt_addr_valid(), there really aren't any
other uses.

(virt_addr_valid() is useful for debugging and for validation of untrusted
pointers, but pfn_valid() just isn't very good for it. Never really was:  
it started out as an ugly hack, and it never got cleaned up. It should be
easily fixable with something _proper_).

			Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  1:29           ` Linus Torvalds
@ 2004-02-05  1:56             ` Keith Mannthey
  2004-02-05  2:04               ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Mannthey @ 2004-02-05  1:56 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Andrew Morton, Martin J. Bligh, linux-kernel@vger.kernel.org,
	linux-mm

On Wed, 2004-02-04 at 17:29, Linus Torvalds wrote:
> 
> So it does need to be fixed, and if it ends up being a noticeable
> perofmance problem, then we can look at the hot-paths one by one and see
> if we can avoid using it. We probably can, most of the time.
> 

Martin sent me a patch that fixed the X panics (NUMA and DISCONTIG
enabled).  (Thanks Martin!) I don't have the same X panics and issues I
had before. I don't know if this will work for the generic case. It
compiles with a simple memory situation just fine but I didn't boot it. 


diff -purN -X /home/mbligh/.diff.exclude virgin/include/asm-i386/mmzone.h pfn_valid/include/asm-i386/mmzone.h
--- virgin/include/asm-i386/mmzone.h    2003-10-01 11:48:22.000000000 -0700
+++ pfn_valid/include/asm-i386/mmzone.h 2004-02-04 16:39:12.000000000 -0800
@@ -84,14 +84,8 @@ extern struct pglist_data *node_data[];
                + __zone->zone_start_pfn;                               \
 })
 #define pmd_page(pmd)          (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
-/*
- * pfn_valid should be made as fast as possible, and the current definition 
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) ) 
- */ 
-#define pfn_valid(pfn)          ((pfn) < num_physpages)
+
+#define pfn_valid(pfn) ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) ) 
 
 /*
  * generic node memory support, the following assumptions apply:



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  1:56             ` Keith Mannthey
@ 2004-02-05  2:04               ` Linus Torvalds
  2004-02-05  2:33                 ` Keith Mannthey
  2004-02-06  7:17                 ` Martin J. Bligh
  0 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2004-02-05  2:04 UTC (permalink / raw
  To: Keith Mannthey
  Cc: Andrew Morton, Martin J. Bligh, linux-kernel@vger.kernel.org,
	linux-mm



On Wed, 4 Feb 2004, Keith Mannthey wrote:
> 
> Martin sent me a patch that fixed the X panics (NUMA and DISCONTIG
> enabled).  (Thanks Martin!) I don't have the same X panics and issues I
> had before. I don't know if this will work for the generic case. It
> compiles with a simple memory situation just fine but I didn't boot it. 

Looks ok, but the thing should be made a function (possibly inline, 
depending on how big the code generated ends up being). As it is, it now 
uses its arguments several times, and while I don't see anything where 
that could screw up, it's just a tad scary.

Also, related to this whole mess, what the _heck_ is this in mm/rmap.c:

        if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
                return pte_chain;

that "pfn_valid(page_to_pfn(page))" just looks totally nonsensical. Can
somebody really pass in random page pointers to this thing, and if so, are
they guaranteed to be "not-random enough" to not cause bogus behaviour
when the "page_to_pfn()" happens to be valid..

If VM_IO gets rid of this, then we should immediately apply the patch.

			Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  2:04               ` Linus Torvalds
@ 2004-02-05  2:33                 ` Keith Mannthey
  2004-02-05  2:47                   ` Linus Torvalds
  2004-02-06  7:17                 ` Martin J. Bligh
  1 sibling, 1 reply; 24+ messages in thread
From: Keith Mannthey @ 2004-02-05  2:33 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Andrew Morton, Martin J. Bligh, linux-kernel@vger.kernel.org,
	linux-mm

On Wed, 2004-02-04 at 18:04, Linus Torvalds wrote:

> If VM_IO gets rid of this, then we should immediately apply the patch.


I tried Andrews VM_IO patch earlier today but it didn't fix the
problem.  

Keith 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  2:33                 ` Keith Mannthey
@ 2004-02-05  2:47                   ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2004-02-05  2:47 UTC (permalink / raw
  To: Keith Mannthey
  Cc: Andrew Morton, Martin J. Bligh, linux-kernel@vger.kernel.org,
	linux-mm



On Wed, 4 Feb 2004, Keith Mannthey wrote:
> 
> I tried Andrews VM_IO patch earlier today but it didn't fix the
> problem.  

Yeah, that patch is not actually converting the pfn_valid() users to only
trust VM_IO, it only does a few special cases (notably the follow_pages()  
thing, which wasn't the issue here).

So the patch would have to be expanded to cover _all_ of the page table
following functions. It probably isn't that much, just looking for code
that checks for PageReserved() will pinpoint the needed users pretty well.

So I think the VM_IO approach could fix this, but it would need to be 
fleshed out more. In the meantime, fixing pfn_valid() is definitely the 
right thing to do.

			Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-05  2:04               ` Linus Torvalds
  2004-02-05  2:33                 ` Keith Mannthey
@ 2004-02-06  7:17                 ` Martin J. Bligh
  2004-02-06  7:19                   ` Martin J. Bligh
  2004-02-06  9:57                   ` Dave Hansen
  1 sibling, 2 replies; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-06  7:17 UTC (permalink / raw
  To: Linus Torvalds, Keith Mannthey
  Cc: Andrew Morton, linux-kernel@vger.kernel.org, linux-mm

>> Martin sent me a patch that fixed the X panics (NUMA and DISCONTIG
>> enabled).  (Thanks Martin!) I don't have the same X panics and issues I
>> had before. I don't know if this will work for the generic case. It
>> compiles with a simple memory situation just fine but I didn't boot it. 
> 
> Looks ok, but the thing should be made a function (possibly inline, 
> depending on how big the code generated ends up being). As it is, it now 
> uses its arguments several times, and while I don't see anything where 
> that could screw up, it's just a tad scary.

Yup, sorry about that. Unfortunately fixing that gets into a small problem
with the definition of pfn_to_nid. I've had a small patch pending for ages
to clean up that mess anyway, so now is probably the right time to push it. 

pfn_to_nid patch follows, and I'll send the (rejigged) original patch in
a follow-up email. Andrew - I'm pretty sure this works fine but could you
possibly test it in -mm for a bit? 

Thanks,

M.

--------------------------------

Makes sure pfn_to_nid is defined for all combinations of subarches,
and that it's defined before it's used so we don't run into implicit
declaration problems. 

diff -aurpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/mmzone.h pfn_to_nid/include/asm-i386/mmzone.h
--- virgin/include/asm-i386/mmzone.h	Mon Nov 17 18:28:57 2003
+++ pfn_to_nid/include/asm-i386/mmzone.h	Thu Feb  5 20:58:00 2004
@@ -10,7 +10,49 @@
 
 #ifdef CONFIG_DISCONTIGMEM
 
+#ifdef CONFIG_NUMA
+	#ifdef CONFIG_X86_NUMAQ
+		#include <asm/numaq.h>
+	#else	/* summit or generic arch */
+		#include <asm/srat.h>
+	#endif
+#else /* !CONFIG_NUMA */
+	#define get_memcfg_numa get_memcfg_numa_flat
+	#define get_zholes_size(n) (0)
+#endif /* CONFIG_NUMA */
+
 extern struct pglist_data *node_data[];
+#define NODE_DATA(nid)		(node_data[nid])
+
+/*
+ * generic node memory support, the following assumptions apply:
+ *
+ * 1) memory comes in 256Mb contigious chunks which are either present or not
+ * 2) we will not have more than 64Gb in total
+ *
+ * for now assume that 64Gb is max amount of RAM for whole system
+ *    64Gb / 4096bytes/page = 16777216 pages
+ */
+#define MAX_NR_PAGES 16777216
+#define MAX_ELEMENTS 256
+#define PAGES_PER_ELEMENT (MAX_NR_PAGES/MAX_ELEMENTS)
+
+extern u8 physnode_map[];
+
+static inline int pfn_to_nid(unsigned long pfn)
+{
+#ifdef CONFIG_NUMA
+	return(physnode_map[(pfn) / PAGES_PER_ELEMENT]);
+#else
+	return 0;
+#endif
+}
+
+static inline struct pglist_data *pfn_to_pgdat(unsigned long pfn)
+{
+	return(NODE_DATA(pfn_to_nid(pfn)));
+}
+
 
 /*
  * Following are macros that are specific to this numa platform.
@@ -43,11 +85,6 @@ extern struct pglist_data *node_data[];
  */
 #define kvaddr_to_nid(kaddr)	pfn_to_nid(__pa(kaddr) >> PAGE_SHIFT)
 
-/*
- * Return a pointer to the node data for node n.
- */
-#define NODE_DATA(nid)		(node_data[nid])
-
 #define node_mem_map(nid)	(NODE_DATA(nid)->node_mem_map)
 #define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid)						\
@@ -92,40 +129,6 @@ extern struct pglist_data *node_data[];
  * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) ) 
  */ 
 #define pfn_valid(pfn)          ((pfn) < num_physpages)
-
-/*
- * generic node memory support, the following assumptions apply:
- *
- * 1) memory comes in 256Mb contigious chunks which are either present or not
- * 2) we will not have more than 64Gb in total
- *
- * for now assume that 64Gb is max amount of RAM for whole system
- *    64Gb / 4096bytes/page = 16777216 pages
- */
-#define MAX_NR_PAGES 16777216
-#define MAX_ELEMENTS 256
-#define PAGES_PER_ELEMENT (MAX_NR_PAGES/MAX_ELEMENTS)
-
-extern u8 physnode_map[];
-
-static inline int pfn_to_nid(unsigned long pfn)
-{
-	return(physnode_map[(pfn) / PAGES_PER_ELEMENT]);
-}
-static inline struct pglist_data *pfn_to_pgdat(unsigned long pfn)
-{
-	return(NODE_DATA(pfn_to_nid(pfn)));
-}
-
-#ifdef CONFIG_X86_NUMAQ
-#include <asm/numaq.h>
-#elif CONFIG_ACPI_SRAT
-#include <asm/srat.h>
-#elif CONFIG_X86_PC
-#define get_zholes_size(n) (0)
-#else
-#define pfn_to_nid(pfn)		(0)
-#endif /* CONFIG_X86_NUMAQ */
 
 extern int get_memcfg_numa_flat(void );
 /*
diff -aurpN -X /home/fletch/.diff.exclude virgin/include/linux/mmzone.h pfn_to_nid/include/linux/mmzone.h
--- virgin/include/linux/mmzone.h	Wed Feb  4 23:03:38 2004
+++ pfn_to_nid/include/linux/mmzone.h	Thu Feb  5 21:01:05 2004
@@ -311,6 +311,7 @@ extern struct pglist_data contig_page_da
 #define NODE_DATA(nid)		(&contig_page_data)
 #define NODE_MEM_MAP(nid)	mem_map
 #define MAX_NODES_SHIFT		1
+#define pfn_to_nid(pfn)		(0)
 
 #else /* CONFIG_DISCONTIGMEM */
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-06  7:17                 ` Martin J. Bligh
@ 2004-02-06  7:19                   ` Martin J. Bligh
  2004-02-06  9:57                   ` Dave Hansen
  1 sibling, 0 replies; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-06  7:19 UTC (permalink / raw
  To: Linus Torvalds, Keith Mannthey
  Cc: Andrew Morton, linux-kernel@vger.kernel.org, linux-mm

Fix pfn_valid for architctures with discontiguous memory.
This only changes the NUMA definition, and it leaves the NUMA-Q
definition as was, because it's faster that way, it's in hotpaths,
and our memory is always contiguous.

diff -aurpN -X /home/fletch/.diff.exclude pfn_to_nid/include/asm-i386/mmzone.h pfn_valid/include/asm-i386/mmzone.h
--- pfn_to_nid/include/asm-i386/mmzone.h	Thu Feb  5 20:58:00 2004
+++ pfn_valid/include/asm-i386/mmzone.h	Thu Feb  5 22:08:57 2004
@@ -121,14 +121,19 @@ static inline struct pglist_data *pfn_to
 		+ __zone->zone_start_pfn;				\
 })
 #define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
-/*
- * pfn_valid should be made as fast as possible, and the current definition 
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) ) 
- */ 
+
+#ifdef CONFIG_X86_NUMAQ            /* we have contiguous memory on NUMA-Q */
 #define pfn_valid(pfn)          ((pfn) < num_physpages)
+#else
+static inline int pfn_valid(int pfn)
+{
+	int nid = pfn_to_nid(pfn);
+
+	if (nid >= 0)
+		return (pfn < node_end_pfn(nid));
+	return 0;
+}
+#endif
 
 extern int get_memcfg_numa_flat(void );
 /*


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-06  7:17                 ` Martin J. Bligh
  2004-02-06  7:19                   ` Martin J. Bligh
@ 2004-02-06  9:57                   ` Dave Hansen
  2004-02-06 15:49                     ` Martin J. Bligh
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2004-02-06  9:57 UTC (permalink / raw
  To: Martin J. Bligh
  Cc: Linus Torvalds, Keith Mannthey, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-mm

On Thu, 2004-02-05 at 23:17, Martin J. Bligh wrote:
> +#ifdef CONFIG_NUMA
> +	#ifdef CONFIG_X86_NUMAQ
> +		#include <asm/numaq.h>
> +	#else	/* summit or generic arch */
> +		#include <asm/srat.h>
> +	#endif
> +#else /* !CONFIG_NUMA */
> +	#define get_memcfg_numa get_memcfg_numa_flat
> +	#define get_zholes_size(n) (0)
> +#endif /* CONFIG_NUMA */

We ran into a bug with #ifdefs like this before.  It was fixed in some
of the code that you're trying to remove.

It's not safe to assume that NUMA && !NUMAQ means SUMMIT.  Remember the
linking errors we got when we turned CONFIG_NUMA on with the regular PC
config?  The generic arch wasn't a problem because it sets
CONFIG_X86_SUMMIT and compiles in the summit code, but the regular PC
code doesn't.  

Also, I don't think we need the #ifdef CONFIG_NUMA around the whole
block.  How about something like this?

#ifdef CONFIG_X86_NUMAQ
	#include <asm/numaq.h>
#elif CONFIG_X86_SUMMIT
	#include <asm/srat.h>
#else
	#define get_memcfg_numa get_memcfg_numa_flat
	#define get_zholes_size(n) (0)
#endif /* CONFIG_NUMA */


> +static inline int pfn_to_nid(unsigned long pfn)
> +{
> +#ifdef CONFIG_NUMA
> +	return(physnode_map[(pfn) / PAGES_PER_ELEMENT]);
> +#else
> +	return 0;
> +#endif
> +}

Looks like somebody pasted that in from a macro. "(pfn)" :)

--dave


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-06  9:57                   ` Dave Hansen
@ 2004-02-06 15:49                     ` Martin J. Bligh
  2004-02-06 17:22                       ` Dave Hansen
  0 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-06 15:49 UTC (permalink / raw
  To: Dave Hansen
  Cc: Linus Torvalds, Keith Mannthey, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-mm

>> +#ifdef CONFIG_NUMA
>> +	#ifdef CONFIG_X86_NUMAQ
>> +		#include <asm/numaq.h>
>> +	#else	/* summit or generic arch */
>> +		#include <asm/srat.h>
>> +	#endif
>> +#else /* !CONFIG_NUMA */
>> +	#define get_memcfg_numa get_memcfg_numa_flat
>> +	#define get_zholes_size(n) (0)
>> +#endif /* CONFIG_NUMA */
> 
> We ran into a bug with #ifdefs like this before.  It was fixed in some
> of the code that you're trying to remove.

What bug?
 
> It's not safe to assume that NUMA && !NUMAQ means SUMMIT.  Remember the
> linking errors we got when we turned CONFIG_NUMA on with the regular PC
> config?  The generic arch wasn't a problem because it sets
> CONFIG_X86_SUMMIT and compiles in the summit code, but the regular PC
> code doesn't.  
> 
> Also, I don't think we need the #ifdef CONFIG_NUMA around the whole
> block.  How about something like this?

If you want to go change it, and test the crap out of it for 3 months on
a variety of platforms, then go for it. What's here works, and is well
tested - I'm sticking with it, unless you can point out a specific case
where it's wrong.

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-06 15:49                     ` Martin J. Bligh
@ 2004-02-06 17:22                       ` Dave Hansen
  2004-02-06 19:59                         ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2004-02-06 17:22 UTC (permalink / raw
  To: Martin J. Bligh
  Cc: Linus Torvalds, Keith Mannthey, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-mm

On Fri, 2004-02-06 at 07:49, Martin J. Bligh wrote:
> >> +#ifdef CONFIG_NUMA
> >> +	#ifdef CONFIG_X86_NUMAQ
> >> +		#include <asm/numaq.h>
> >> +	#else	/* summit or generic arch */
> >> +		#include <asm/srat.h>
> >> +	#endif
> >> +#else /* !CONFIG_NUMA */
> >> +	#define get_memcfg_numa get_memcfg_numa_flat
> >> +	#define get_zholes_size(n) (0)
> >> +#endif /* CONFIG_NUMA */
> > 
> > We ran into a bug with #ifdefs like this before.  It was fixed in some
> > of the code that you're trying to remove.
> 
> What bug?

With a regular PC config, plus CONFIG_NUMA turned on:
  CC      arch/i386/kernel/process.o
In file included from include/asm/mmzone.h:17,
                 from include/linux/mmzone.h:318,
                 from include/linux/gfp.h:4,
                 from include/linux/slab.h:15,
                 from include/linux/percpu.h:4,
                 from include/linux/sched.h:31,
                 from include/linux/module.h:10,
                 from init/do_mounts.c:1:
include/asm/srat.h:31: #error CONFIG_ACPI_SRAT not defined, and srat.h
header has been included
In file included from include/asm/mmzone.h:17,
                 from include/linux/mmzone.h:318,
                 from include/linux/gfp.h:4,
                 from include/linux/slab.h:15,
                 from include/linux/percpu.h:4,
                 from include/linux/rcupdate.h:42,
                 from include/linux/dcache.h:10,
                 from include/linux/fs.h:17,
                 from init/do_mounts_initrd.c:3:

I can post the config if you like.  You were the one who made me go fix
it in the first place.  That's why I added that #error. :)

--dave


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-06 17:22                       ` Dave Hansen
@ 2004-02-06 19:59                         ` Martin J. Bligh
  2004-02-06 20:16                           ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-06 19:59 UTC (permalink / raw
  To: Dave Hansen
  Cc: Linus Torvalds, Keith Mannthey, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-mm, Andi Kleen

--On Friday, February 06, 2004 09:22:49 -0800 Dave Hansen <haveblue@us.ibm.com> wrote:

> On Fri, 2004-02-06 at 07:49, Martin J. Bligh wrote:
>> >> +#ifdef CONFIG_NUMA
>> >> +	#ifdef CONFIG_X86_NUMAQ
>> >> +		#include <asm/numaq.h>
>> >> +	#else	/* summit or generic arch */
>> >> +		#include <asm/srat.h>
>> >> +	#endif
>> >> +#else /* !CONFIG_NUMA */
>> >> +	#define get_memcfg_numa get_memcfg_numa_flat
>> >> +	#define get_zholes_size(n) (0)
>> >> +#endif /* CONFIG_NUMA */
>> > 
>> > We ran into a bug with #ifdefs like this before.  It was fixed in some
>> > of the code that you're trying to remove.
>> 
>> What bug?
> 
> With a regular PC config, plus CONFIG_NUMA turned on:

Ah ... that's the problem. That's not a valid config - the correct way
to do that is with generic arch, not the PC one. Somehow we ended up
leaving that as allowable ... I think that was just a communiciation
breakdown somewhere between you, Andi, and myself (or quite possibly
between myself and myself ;-)).

So ... I still think my original patch is correct (there's some stylistic
stuff we could debate, but it's not a functional problem). Here's an
additional patch that stops people from turning on NUMA for the PC
subarch, which it wasn't designed to work with.

Thanks,

M.

-------------------------------------------------------------

Disallow NUMA on the i386 PC subarch (it doesn't work, nor was it intended to).

diff -purN -X /home/mbligh/.diff.exclude pfn_to_nid/arch/i386/Kconfig pc_numa/arch/i386/Kconfig
--- pfn_to_nid/arch/i386/Kconfig	2004-02-04 16:23:49.000000000 -0800
+++ pc_numa/arch/i386/Kconfig	2004-02-06 11:16:19.000000000 -0800
@@ -701,7 +701,7 @@ config X86_PAE
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation Support"
-	depends on SMP && HIGHMEM64G && (X86_PC || X86_NUMAQ || X86_GENERICARCH || (X86_SUMMIT && ACPI && !ACPI_HT_ONLY))
+	depends on SMP && HIGHMEM64G && (X86_NUMAQ || X86_GENERICARCH || (X86_SUMMIT && ACPI && !ACPI_HT_ONLY))
 	default n if X86_PC
 	default y if (X86_NUMAQ || X86_SUMMIT)
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-06 19:59                         ` Martin J. Bligh
@ 2004-02-06 20:16                           ` Linus Torvalds
  2004-02-06 21:18                             ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2004-02-06 20:16 UTC (permalink / raw
  To: Martin J. Bligh
  Cc: Dave Hansen, Keith Mannthey, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-mm, Andi Kleen



On Fri, 6 Feb 2004, Martin J. Bligh wrote:
> 
> Ah ... that's the problem. That's not a valid config

It really _should_ be a valid config, though. Otherwise, nobody can ever 
test it in any reasonable way on a regular PC.

So why not allow a NuMA config for a PC (and it should end up as being 
just one node, of course)?

		Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-06 20:16                           ` Linus Torvalds
@ 2004-02-06 21:18                             ` Martin J. Bligh
  0 siblings, 0 replies; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-06 21:18 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Dave Hansen, Keith Mannthey, Andrew Morton,
	linux-kernel@vger.kernel.org, linux-mm, Andi Kleen

> On Fri, 6 Feb 2004, Martin J. Bligh wrote:
>> 
>> Ah ... that's the problem. That's not a valid config
> 
> It really _should_ be a valid config, though. Otherwise, nobody can ever 
> test it in any reasonable way on a regular PC.
> 
> So why not allow a NuMA config for a PC (and it should end up as being 
> just one node, of course)?

We have that - it's what the generic arch is. It's also good for distros, 
as it'll enable them to build one binary kernel and run it on flat SMP 
boxes and the Summit/x440 boxes.

If we really want to do good testing, we should make a fake NUMA config
that can run a 4x SMP box as fake NUMA, with half the memory in each
"node" and half the processors ... but I never got around to coding that ;-)

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
       [not found]                             ` <220850000.1076102320@flay.suse.lists.linux.kernel>
@ 2004-02-07  3:54                               ` Andi Kleen
  2004-02-07  4:49                                 ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2004-02-07  3:54 UTC (permalink / raw
  To: Martin J. Bligh; +Cc: linux-kernel

"Martin J. Bligh" <mbligh@aracnet.com> writes:
 
> If we really want to do good testing, we should make a fake NUMA config
> that can run a 4x SMP box as fake NUMA, with half the memory in each
> "node" and half the processors ... but I never got around to coding that ;-)

I have such a patch for x86-64 if anybody is interested in that.

x86-64 low level NUMA is quite different from IA32 NUMA though so it 
would be a bit difficult to port.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-07  3:54                               ` Andi Kleen
@ 2004-02-07  4:49                                 ` Martin J. Bligh
  2004-02-07  5:21                                   ` Andi Kleen
  2004-02-07  6:37                                   ` Nick Piggin
  0 siblings, 2 replies; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-07  4:49 UTC (permalink / raw
  To: Andi Kleen, Nick Piggin; +Cc: linux-kernel

--Andi Kleen <ak@suse.de> wrote (on Saturday, February 07, 2004 04:54:03 +0100):

> "Martin J. Bligh" <mbligh@aracnet.com> writes:
>  
>> If we really want to do good testing, we should make a fake NUMA config
>> that can run a 4x SMP box as fake NUMA, with half the memory in each
>> "node" and half the processors ... but I never got around to coding that ;-)
> 
> I have such a patch for x86-64 if anybody is interested in that.
> 
> x86-64 low level NUMA is quite different from IA32 NUMA though so it 
> would be a bit difficult to port.

Not quite sure what you mean ... I was driving at pretending an SMP box
was NUMA ... but the x86_64 is already NUMA ... are you grouping nodes
together into single nodes with 2 cpus each?

What might be intriguing is to use Nick's domains stuff to create a heirarchy
for the scheduler where we have 1 cpu nodes and 2 cpu nodes above that, but
still keep the normal NUMA stuff flat for mem allocation. What might be 
interesting is a heirarchy where if this is the HT connections of cpu layouts:

1 --- 2
|     |
|     | 
|     |
3 --- 4

then domains of (1,2,3) (2,3,4) (1,3,4) (1 2 4), with a view to restricting
the "double hop" traffic as much as possible. But I'm not sure the domains
code copes with multiple overlapping domains - Nick?

Andi, do you already set up the mem allocation fallback zonelists like that?

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-07  4:49                                 ` Martin J. Bligh
@ 2004-02-07  5:21                                   ` Andi Kleen
  2004-02-07  6:37                                   ` Nick Piggin
  1 sibling, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2004-02-07  5:21 UTC (permalink / raw
  To: Martin J. Bligh; +Cc: piggin, linux-kernel

On Fri, 06 Feb 2004 20:49:40 -0800
"Martin J. Bligh" <mbligh@aracnet.com> wrote:

> Not quite sure what you mean ... I was driving at pretending an SMP box
> was NUMA ... but the x86_64 is already NUMA ... are you grouping nodes
> together into single nodes with 2 cpus each?

There are Opteron boxes which are not NUMA. Or rather they are NUMA, but only
have a single node. Some of the cheaper boards only connect the DIMM
slots to a single CPU, which gives you only a single node even with
two CPUs. One of the test machines I have here is of this type. 

It's also useful for testing on simulators.

> Andi, do you already set up the mem allocation fallback zonelists like that?

I don't do anything special, it's all generic page_alloc.c logic.
 
-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-07  4:49                                 ` Martin J. Bligh
  2004-02-07  5:21                                   ` Andi Kleen
@ 2004-02-07  6:37                                   ` Nick Piggin
  2004-02-07  7:31                                     ` Martin J. Bligh
  1 sibling, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2004-02-07  6:37 UTC (permalink / raw
  To: Martin J. Bligh; +Cc: Andi Kleen, linux-kernel



Martin J. Bligh wrote:

>--Andi Kleen <ak@suse.de> wrote (on Saturday, February 07, 2004 04:54:03 +0100):
>
>
>>"Martin J. Bligh" <mbligh@aracnet.com> writes:
>> 
>>
>>>If we really want to do good testing, we should make a fake NUMA config
>>>that can run a 4x SMP box as fake NUMA, with half the memory in each
>>>"node" and half the processors ... but I never got around to coding that ;-)
>>>
>>I have such a patch for x86-64 if anybody is interested in that.
>>
>>x86-64 low level NUMA is quite different from IA32 NUMA though so it 
>>would be a bit difficult to port.
>>
>
>Not quite sure what you mean ... I was driving at pretending an SMP box
>was NUMA ... but the x86_64 is already NUMA ... are you grouping nodes
>together into single nodes with 2 cpus each?
>
>What might be intriguing is to use Nick's domains stuff to create a heirarchy
>for the scheduler where we have 1 cpu nodes and 2 cpu nodes above that, but
>still keep the normal NUMA stuff flat for mem allocation. What might be 
>interesting is a heirarchy where if this is the HT connections of cpu layouts:
>
>1 --- 2
>|     |
>|     | 
>|     |
>3 --- 4
>
>then domains of (1,2,3) (2,3,4) (1,3,4) (1 2 4), with a view to restricting
>the "double hop" traffic as much as possible. But I'm not sure the domains
>code copes with multiple overlapping domains - Nick?
>
>

Yes it can do ring topologies like this. I'm pretty sure it can do
just about any sort of topology although this is one that I sat
down and drew when designing it.

You can technically restrict a double hop, but after you move, say,
clockwise once, you might just as easily be moved clockwise again.
The only way to restrict this is with some kind of home domain thing.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X  (fwd)
  2004-02-07  6:37                                   ` Nick Piggin
@ 2004-02-07  7:31                                     ` Martin J. Bligh
  0 siblings, 0 replies; 24+ messages in thread
From: Martin J. Bligh @ 2004-02-07  7:31 UTC (permalink / raw
  To: Nick Piggin; +Cc: Andi Kleen, linux-kernel

>> 1 --- 2
>>|     | 
>>|     | 
>>|     | 
>> 3 --- 4
>> 
>> then domains of (1,2,3) (2,3,4) (1,3,4) (1 2 4), with a view to restricting
>> the "double hop" traffic as much as possible. But I'm not sure the domains
>> code copes with multiple overlapping domains - Nick?
>> 
>> 
> 
> Yes it can do ring topologies like this. I'm pretty sure it can do
> just about any sort of topology although this is one that I sat
> down and drew when designing it.
> 
> You can technically restrict a double hop, but after you move, say,
> clockwise once, you might just as easily be moved clockwise again.
> The only way to restrict this is with some kind of home domain thing.

Well, this doesn't ban it, but requiring the double migrate will curtail
it somewhat, which is better than nothing.

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2004-02-07  7:31 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-04 23:17 [Bugme-new] [Bug 2019] New: Bug from the mm subsystem involving X (fwd) Martin J. Bligh
2004-02-04 23:58 ` Linus Torvalds
2004-02-05  0:12   ` Martin J. Bligh
2004-02-05  0:36     ` Martin J. Bligh
2004-02-05  0:43       ` Linus Torvalds
2004-02-05  0:56         ` Andrew Morton
2004-02-05  1:29           ` Linus Torvalds
2004-02-05  1:56             ` Keith Mannthey
2004-02-05  2:04               ` Linus Torvalds
2004-02-05  2:33                 ` Keith Mannthey
2004-02-05  2:47                   ` Linus Torvalds
2004-02-06  7:17                 ` Martin J. Bligh
2004-02-06  7:19                   ` Martin J. Bligh
2004-02-06  9:57                   ` Dave Hansen
2004-02-06 15:49                     ` Martin J. Bligh
2004-02-06 17:22                       ` Dave Hansen
2004-02-06 19:59                         ` Martin J. Bligh
2004-02-06 20:16                           ` Linus Torvalds
2004-02-06 21:18                             ` Martin J. Bligh
     [not found] <51080000.1075936626@flay.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.58.0402041539470.2086@home.osdl.org.suse.lists.linux.kernel>
     [not found]   ` <60330000.1075939958@flay.suse.lists.linux.kernel>
     [not found]     ` <64260000.1075941399@flay.suse.lists.linux.kernel>
     [not found]       ` <Pine.LNX.4.58.0402041639420.2086@home.osdl.org.suse.lists.linux.kernel>
     [not found]         ` <20040204165620.3d608798.akpm@osdl.org.suse.lists.linux.kernel>
     [not found]           ` <Pine.LNX.4.58.0402041719300.2086@home.osdl.org.suse.lists.linux.kernel>
     [not found]             ` <1075946211.13163.18962.camel@dyn318004bld.beaverton.ibm.com.suse.lists.linux.kernel>
     [not found]               ` <Pine.LNX.4.58.0402041800320.2086@home.osdl.org.suse.lists.linux.kernel>
     [not found]                 ` <98220000.1076051821@[10.10.2.4].suse.lists.linux.kernel>
     [not found]                   ` <1076061476.27855.1144.camel@nighthawk.suse.lists.linux.kernel>
     [not found]                     ` <5450000.1076082574@[10.10.2.4].suse.lists.linux.kernel>
     [not found]                       ` <1076088169.29478.2928.camel@nighthawk.suse.lists.linux.kernel>
     [not found]                         ` <218650000.1076097590@flay.suse.lists.linux.kernel>
     [not found]                           ` <Pine.LNX.4.58.0402061215030.30672@home.osdl.org.suse.lists.linux.kernel>
     [not found]                             ` <220850000.1076102320@flay.suse.lists.linux.kernel>
2004-02-07  3:54                               ` Andi Kleen
2004-02-07  4:49                                 ` Martin J. Bligh
2004-02-07  5:21                                   ` Andi Kleen
2004-02-07  6:37                                   ` Nick Piggin
2004-02-07  7:31                                     ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).