de-luxe zone allocator, design 2

Linux-mm Archive mirror
 help / color / mirror / Atom feed

* de-luxe zone allocator, design 2
@ 1998-07-24 21:03 Rik van Riel
  1998-07-27 11:12 ` Stephen C. Tweedie
  0 siblings, 1 reply; 6+ messages in thread
From: Rik van Riel @ 1998-07-24 21:03 UTC (permalink / raw
  To: Linux MM

[-- Attachment #1: Type: TEXT/PLAIN, Size: 961 bytes --]

Hi,

here is the new design for the de-luxe zone allocator.

With the reports of Linux booting in 3 MB it's probably
time for some low-mem adjustments, but in general this
scheme should be somewhat better designed overall.

The biggest change is the fact that the zone queues
have been dropped for user pages in favor of page
queues with multi-level LRU reclamation with lazy
reclaim.

We also want to do page table repacking, or even
swapping. This _is_ feasable, as long as we don't
touch the page directories...
Maybe we want to put the page directories with the
SLAB and stack stuff and allocate page tables together
with user pages (we _can_ safely repack page tables).

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

[-- Attachment #2: Zone allocator, design 2 --]
[-- Type: TEXT/PLAIN, Size: 12301 bytes --]

<HTML>
<HEAD>
   <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
   <META NAME="Author" CONTENT="Rik van Riel">
   <META NAME="GENERATOR" CONTENT="Mozilla/4.05 [en] (X11; I; Linux 2.1.108 i586) [Netscape]">
   <META NAME="Description" CONTENT="This is the white paper for the new, zone-based Linux memory allocator.">
   <META NAME="Keywords" CONTENT="memory, management, mm, linux, zone, allocator, zone-based, memory management, kernel, white paper, source, Rik van Riel">
   <TITLE>Design for the Linux zone based memory allocator</TITLE>
</HEAD>
<BODY>
&nbsp;
<CENTER><TABLE>
<TR>
<TD>[<A HREF="http://www.phys.uu.nl/~riel/mm-patch/">Linux MM home page</A>]</TD>

<TD>[<A HREF="http://www.linuxhq.com/">Linux Headquarters</A>]</TD>

<TD>[<A HREF="http://www.phys.uu.nl/~riel/">My home page</A>]</TD>
</TR>
</TABLE></CENTER>

<HR NOSHADE WIDTH="100%">
<CENTER>
<H1>
Linux MM: design for a zone based memory allocator</H1></CENTER>

<CENTER>
<ADDRESS>
Rik van Riel, July 1998</ADDRESS></CENTER>

<P>&nbsp;
<BR>One of the biggest problems currently facing the Linux memory management
subsystem is memory fragmentation. This is the result of several developments
in other parts of the Linux kernel, most importantly the growth of each
process'es kernel stack to 8 kB and the dynamic allocation of DMA and networking
buffers. These factors, together with a general speedup of both peripheral
hardware and the device drivers has lead to a situation where the currently
used buddy allocator just can't cut it anymore. This white-paper is divided
in 3 pieces, <A HREF="#problem">the problem</A>, <A HREF="#solution">the
solution</A> and some <A HREF="#code">actual code</A>. I need a lot of
comments and hints for possible improvement, so feel free to <A HREF="mailto:H.H.vanRiel@phys.uu.nl">email
them to me</A>...
<H2>
<A NAME="problem"></A>The problem</H2>
The problem is caused by the fact that memory is allocated in chunks of
different sizes. For most types of usage we just allocate memory one page
(4 kB on most machines) at a time, but sometimes we give out larger pieces
of memory (2, 4, 8, 16 or 32 pages at once). Because of the fact that most
UNIX (and Linux) machines have a completely full memory (free memory is
wasted memory), it is next to impossible to free larger area's and the
best we can do is be very careful not to hand out those large areas when
we only need a small one.

<P>There have been (and there are) several workarounds for this fragmentation
issue; one of them (PTE chaining) even involves a physical to logical translating,
almost reverse page table-like solution. With that project, we can swap
out pages based on their physical address, thus force freeing that one
page that blocked an entire 128 kB area. This would solve most of our problems,
except when that last page is unswappable, for example a page table or
a program's kernel stack. In that case, we're screwed regardlessly of what
deallocation scheme we're using.

<P>Because our inability to hand out larger chunks of memory has impact
on system functionality and could even have impact on system stability
it seems warranted to sacrifice a little bit of speed (the buddy system
is fast!) in order to solve most of the above problems. The main problem
with the current system is that it doesn't differentiate between swappable
and unswappable memory, leading to a system where page tables and other
cruft are scattered all over the system, making it impossible to free up
one large contiguous area.

<P>This problem is made even worse by the fact that on some architectures
we can only do DMA to addresses under 16 MB and it will undoubtedly show
up again in some years when we all have 16 GB of memory and try do do DMA
to those oldie 32 bit PCI cards that don't support dual cycle address mode
:-)
<H2>
<A NAME="solution"></A>The solution</H2>
The solution is to hand out free zones of 128 kB large, and to use each
zone for one type of usage only. Then we can be sure that no page tables
interfere with the freeing of a zone of user memory, and we can always
just free an area of memory.

<P>In the current Linux kernel, we have the following uses for memory:
<UL>
<LI>
<B>reserved memory, kernel code</B> and <B>statically allocated kernel
structures</B>: after system boot we never much with the layout of this
memory so it's a non issue wrt. the allocator</LI>

<LI>
<B>user memory</B>: this memory can be swapped out and/or relocated at
will, it is allocated one page at a time and gives us no trouble, apart
from the fact that we always need more than we have physically available;
no special requirements</LI>

<LI>
<B>kernel stack</B>: we allocate 8 kB (2 pages) of unswappable kernel stack
for each process; each of those stacks needs to be physically contiguous
and it needs to be in fast memory (not in uncached memory)</LI>

<LI>
<B>page tables</B>: page directories are unswappable, page tables and (on
some machines) page middle directories can be moved/swapped with great
caution; the memory for these is given out one page at a time; we only
look up the page tables every once in a while so speed is not very critical;
when we have uncached memory, we'd rather use it for page tables than for
user pages</LI>

<LI>
<B>small SLAB</B>: SLAB memory is used for dynamic kernel data; it is allocated
and freed at will, unfortunately this will is not ours but that of the
(device) driver that requested the memory; speed is critical</LI>

<LI>
<B>large SLAB</B>: the same as small SLAB, but sometimes the kernel wants
large chunks (> 2 pages); we make the distinction between the two because
we don't want to face hopeless fragmentation inside the SLAB zones...</LI>

<BR><B>DMA buffers</B>: this memory needs to be physically below a certain
boundary (16 MB for ISA DMA) and is often allocated in chunks of 32, 64
or 128 kB</UL>
For small (&lt; 16 MB) machines, the above scheme is overkill and we treat
several types of usage as one. We can, for instance, treat large SLAB and
DMA the same, and small SLAB, kernel stack and page table can be allocated
in the same zones too. Small slab and kernel stack will be treated the
same on every machine; the distinction is only made because I want the
documentation to be complete.

<P>In addition to this, we can differentiate between 3 different kinds
of memory:
<UL>
<LI>
<B>DMA memory</B>: this memory is located under the 16 MB limit and is
cached by the L1 and L2 caches</LI>

<LI>
<B>'normal' memory</B>: this memory is located above the DMA limit and
is cached by the L1 and L2 caches, it can not be used for DMA buffers</LI>

<LI>
<B>slow memory</B>: this memory is not cached or present on an add-on board,
it can not be used for DMA buffers and using it for time critical kernel
stack and SLAB would be disastrous for performance; we also don't want
to use it for CPU intensive user applications</LI>
</UL>
Since we don't want to waste the slow memory we might have, we can use
that for page tables and user memory that isn't used very often. If we
have user memory in slow memory and it turns out that it is used very often
we can always use the swap code to relocate it to fast memory. DMA memory
is scarce, so we want to allocate that only we specifically need it or
when we don't have any other memory left.

<P>This leads to the following zone allocation orders:
<BR>&nbsp;
<TABLE BORDER COLS=4 NOSAVE >
<TR NOSAVE>
<TD NOSAVE>SLAB and kernel stack</TD>

<TD>user memory</TD>

<TD>page tables</TD>

<TD>DMA buffers</TD>
</TR>

<TR ALIGN=LEFT VALIGN=TOP NOSAVE>
<TD>
<LI>
normal memory</LI>

<LI>
DMA memory</LI>

<LI>
slow memory</LI>
</TD>

<TD>
<LI>
normal memory</LI>

<LI>
slow memory</LI>

<LI>
DMA memory</LI>
</TD>

<TD ALIGN=LEFT VALIGN=TOP NOSAVE>
<LI>
slow memory</LI>

<LI>
normal memory</LI>

<LI>
DMA memory</LI>
</TD>

<TD>
<LI>
DMA memory</LI>
</TD>
</TR>
</TABLE>

<P>This means that when, for instance, we ran out of user memory and there
is enough free memory available, we first try to grab a zone of 'normal
memory', if that fails we look for a free area of slow memory and DMA memory
is tried last.
<H3>
Page allocation</H3>
For SLAB, page table and DMA memory we always try to allocate from the
fullest zone available and we grab a free zone when we're out of our own
memory. In order to grab the fullest zone, we keep these zones in a (partially?)
sorted order. For large SLAB/DMA&nbsp;areas we will also want to keep in
mind the sizes of the memory chunks previously allocated in this zone.

<P>User pages are kept on a number of linked lists: active, inactive, clean
and free. We allocate new pages in the inactive queue and perform allocations
from the free queue first, moving to the clean queue when we're out of
free pages. Inactive pages get either promoted to the active queue (when
they're in heavy use) or demoted to the clean queue (when they're dirty,
we have to clean them first). Pages in the clean queue are also unmapped
from the page table and thus already 'halfway swapped out'. Pages only
enter the free list when a program free()s pages or when we add a new zone
to the user area.

<P>In order to be able to free new zones (for when SLAB gets overly active),
we need to be able to mark a relatively free zone force-freeable. Upon
scanning such a page kswapd will free the page and make sure it isn't allocated
again.When the PTE chaining system gets integrated into the kernel, we
can just force-free a user zone with relatively few active pages when the
system runs out of free zones. Until then we'll need to keep two free zones
and walk the page tables to find and free the pages.
<H2>
<A NAME="code"></A>Actual code</H2>
There's not much of actual code yet but all the administrative details
are ready. ALPHA status reached and the .h file is ready :)
<PRE>/*
 * The struct mem_zone is used to describe a 32 page memory area.
 */

struct mem_zone {
 mem_zone * prev, next; /* The previous and next zone on this list */
 unsigned long used; /* Used pages bitmap for SLAB, etc !!! count for user */
 unsigned long flags;
};

/*
 * Flags for struct_mem->flags
 */

#define ZONE_DMA 0x00000001 /* DMA memory */
#define ZONE_SLOW 0x00000002 /* uncached/slow memory */
#define ZONE_USER 0x00000004 /* usermode pages, these defines are for paranoia only */
#define ZONE_SLAB 0x00000008 /* large SLAB */
#define ZONE_STK 0x00000010 /* kernel stack and order-1 SLAB (and order-0 SLAB if there is slow memory) */
#define ZONE_PTBL 0x00000020 /* page tables and one-page SLAB (except when there is slow memory) */
#define ZONE_DMA 0x00000040 /* DMAbuffers */
#define ZONE_RECL 0x00000080 /* We are reclaiming this zone */
#define ZONE_0 0x00000100 /* loose pages allocated */
#define ZONE_1 0x00000200 /*order-1 (2^1 = 2 page)chunks allocated */
#define ZONE_2 0x00000400 /* etc... In order to help in buddy-like allocation for */
#define ZONE_3 0x00000800 /* large SLAB zones on small memory machines. */
#define ZONE_4 0x00001000
#define ZONE_5 0x00002000

/*
 * Memory statistics
 */

typedef struct {
	unsigned long used;
	unsigned long free;
} zone_stats_t;

struct memstats {
	struct zone_stats_t ptbl;
	struct zone_stats_t stk;
	struct zone_stats_t slab;
	struct zone_stats_t dma;
	/* Slightly different structs for these */
	struct user {
		unsigned long active;
		unsigned long inactive;
		unsigned long clean;	/* we do lazy reclamation */
		unsigned long free;
	};
	struct free {
		unsigned long dma;	/* different memory types */
		unsigned long normal;
		unsigned long slow;
	};
	struct misc {
		unsigned long num_physpages;
		unsigned long reserved;	/* reserved pages */
		unsigned long kernel;	/* taken by static kernel stuff */
	};
};

/* This is where we find the different zones */

struct memzones {
	struct free {
		struct mem_zone dma;
		struct mem_zone normal;
		struct mem_zone slow;
	};
	struct mem_zone dma;
	struct mem_zone user;
	struct mem_zone slab;
	struct mem_zone stk;
	struct mem_zone ptbl;
};

</PRE>
</BODY>
</HTML>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: de-luxe zone allocator, design 2
  1998-07-24 21:03 de-luxe zone allocator, design 2 Rik van Riel
@ 1998-07-27 11:12 ` Stephen C. Tweedie
  1998-07-28 16:15   ` Rik van Riel
  0 siblings, 1 reply; 6+ messages in thread
From: Stephen C. Tweedie @ 1998-07-27 11:12 UTC (permalink / raw
  To: Rik van Riel; +Cc: Linux MM

Hi,

On Fri, 24 Jul 1998 23:03:18 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> With the reports of Linux booting in 3 MB it's probably
> time for some low-mem adjustments, but in general this
> scheme should be somewhat better designed overall.

In 3MB, zoning in 128k chunks is crazy --- you want 8k chunks, max!
The extra cost incurred by less efficient packing will also cripple
you on 3MB with zones.  The problem definition is different on such
small machines.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: de-luxe zone allocator, design 2
  1998-07-27 11:12 ` Stephen C. Tweedie
@ 1998-07-28 16:15   ` Rik van Riel
  1998-07-29 11:04     ` Stephen C. Tweedie
  0 siblings, 1 reply; 6+ messages in thread
From: Rik van Riel @ 1998-07-28 16:15 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Linux MM

On Mon, 27 Jul 1998, Stephen C. Tweedie wrote:
> On Fri, 24 Jul 1998 23:03:18 +0200 (CEST), Rik van Riel
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > With the reports of Linux booting in 3 MB it's probably
> > time for some low-mem adjustments, but in general this
> > scheme should be somewhat better designed overall.
> 
> In 3MB, zoning in 128k chunks is crazy --- you want 8k chunks, max!

Agreed, although 16k chunks would probably be better for
3 and 4 MB machines (allows DMA and network buffers).

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: de-luxe zone allocator, design 2
  1998-07-28 16:15   ` Rik van Riel
@ 1998-07-29 11:04     ` Stephen C. Tweedie
  1998-07-29 18:00       ` Rik van Riel
  0 siblings, 1 reply; 6+ messages in thread
From: Stephen C. Tweedie @ 1998-07-29 11:04 UTC (permalink / raw
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Linux MM

Hi,

On Tue, 28 Jul 1998 18:15:41 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Mon, 27 Jul 1998, Stephen C. Tweedie wrote:
>> On Fri, 24 Jul 1998 23:03:18 +0200 (CEST), Rik van Riel
>> <H.H.vanRiel@phys.uu.nl> said:
>> 
>> > With the reports of Linux booting in 3 MB it's probably
>> > time for some low-mem adjustments, but in general this
>> > scheme should be somewhat better designed overall.
>> 
>> In 3MB, zoning in 128k chunks is crazy --- you want 8k chunks, max!

> Agreed, although 16k chunks would probably be better for
> 3 and 4 MB machines (allows DMA and network buffers).

In 2.1.112 we've now chopped down the default slab size for large
objects from 16k to 8k, so 8k should be OK.  (8k NFS still needs 16k
chunks, but you really want to be using 4k or smaller blocks on such
low memory machines anyway, for precisely this reason.)

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: de-luxe zone allocator, design 2
  1998-07-29 11:04     ` Stephen C. Tweedie
@ 1998-07-29 18:00       ` Rik van Riel
  1998-07-30 23:35         ` Stephen C. Tweedie
  0 siblings, 1 reply; 6+ messages in thread
From: Rik van Riel @ 1998-07-29 18:00 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Linux MM

On Wed, 29 Jul 1998, Stephen C. Tweedie wrote:
> On Tue, 28 Jul 1998 18:15:41 +0200 (CEST), Rik van Riel
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > Agreed, although 16k chunks would probably be better for
> > 3 and 4 MB machines (allows DMA and network buffers).
> 
> In 2.1.112 we've now chopped down the default slab size for large
> objects from 16k to 8k, so 8k should be OK.  (8k NFS still needs 16k
> chunks, but you really want to be using 4k or smaller blocks on such
> low memory machines anyway, for precisely this reason.)

Isn't packing locality _very_ important on machines where you
have no memory to spare? In that case 16k would probably be
better; PLUS 16k will allow for real DMA buffers and stuff.

When we have lazy page reclamation, we don't have to keep free
the amount of pages we're keeping free now, so 16k areas should
be OK... (we'll only need to keep 2 of those free on most small
machines)

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: de-luxe zone allocator, design 2
  1998-07-29 18:00       ` Rik van Riel
@ 1998-07-30 23:35         ` Stephen C. Tweedie
  0 siblings, 0 replies; 6+ messages in thread
From: Stephen C. Tweedie @ 1998-07-30 23:35 UTC (permalink / raw
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Linux MM

Hi,

On Wed, 29 Jul 1998 20:00:14 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> Isn't packing locality _very_ important on machines where you
> have no memory to spare? In that case 16k would probably be
> better; PLUS 16k will allow for real DMA buffers and stuff.

Not for short-lived packets, where you don't want to reserve a whole 16k
or more just for a single instance of an NFS packet.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~1998-07-31  9:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
1998-07-24 21:03 de-luxe zone allocator, design 2 Rik van Riel
1998-07-27 11:12 ` Stephen C. Tweedie
1998-07-28 16:15   ` Rik van Riel
1998-07-29 11:04     ` Stephen C. Tweedie
1998-07-29 18:00       ` Rik van Riel
1998-07-30 23:35         ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).