NUMA API

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* NUMA API
@ 2004-04-30  7:35 Ulrich Drepper
  2004-04-30  8:30 ` William Lee Irwin III
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Ulrich Drepper @ 2004-04-30  7:35 UTC (permalink / raw
  To: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 4077 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In the last weeks I have been working on designing a new API for a NUMA
support library.  I am aware of the code in libnuma by ak but this code
has many shortcomings:

~ inadequate topology discovery
~ fixed cpu set size
~ no clear separation of memory nodes
~ no inclusion of SMT/multicore in the cpu hierarchy
~ awkward (at best) memory allocation interface
~ etc etc

and last but not least

~ a completely unacceptable library interface (e.g., global variables as
part of the API, WTF?)

At the end of the attached document is a comparison of the two APIs.

I'm only posting now about this since I wanted to get some sanity checks
of the API first.  Some of our (i.e., Red Hat's) partners provided this.
 They might identify themselves, or not.  This is not because other
parties are meant to be excluded.

The API described here is meant to be a minimal which can be wrapped for
use in any kind of higher-level language (or even in another C library
using the interface).  For this reason the CPU and memory node sets are
not handled by an abstract data type but instead as bitmap.  Using an
abstract data types (in C) means restricting the way wrapper libraries
can be designed.  In a C++ wrapper, for instance, the bit sets certainly
should be abstract.  A later version of the attached document might try
provide higher-level interfaces.

The text of the API proposal is not yet polished.  In fact, most
descriptions are fairly short.  I want o get some more assurance that
the API is received well before spending significantly more time on it.

As specified, the implementation of the interface is designed with only
the requirements of a program on NUMA hardware in mind.  I have paid no
attention to the currently proposed kernel extensions.  If the latter do
not really allow implementing the functionality programmers need then it
is wasted efforts.

For instance, I think the way memory allocated in interleaved fashion is
not "ideal".  Interleaved allocation is a property of a specific
allocation.  Global states for processes (or threads) are a terrible way
to handle this and other properties since it requires the programmer to
constantly switch the mode back and forth since any part of the runtime
might be NUMA aware and reset the mode.

Also, the concept of hard/soft sets for CPUs is useful.  Likewise
"spilling" over to other memory nodes.  Usually using NUMA means hinting
the desired configuration to the system.  It'll be used whenever
possible.  If it is not possible (for instance, if a given processor is
not available) it is mostly no good idea to completely fail the
execution.  Instead a less optimal resource should be used.  For memory
it is hard to know how much memory on which node is in use etc.

Another missing feature in libnuma and the current kernel design is
support for changes in the configuration.  CPUs might be added or
removed, likewise memory.  Additional interconnects between NUMA blocks
might be added etc.

Overall I think the proposed API provides a architecture-independent,
future-safe NUMA API.  If no program uses the kernel functionality
directly (which is possible with the API) the kernel interface can be
changed and adopted for each architecture or even specific machine
without the program noticing it.

The selection of names for the functions is by no means fixed.  These
are proposals.  I'm open for constructive criticism.  In case you find
interfaces to be missing or wrong or not optimal, please let me know as
well.  Once the API is regarded useful we can start thinking about the
kernel interface so keep these two things separated.

Please direct comments to me.  In case there is interest I can set up a
separate mailing list since lkml is probably not the best venue.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAkgG92ijCOnn/RHQRAqgUAJ9bJ83LxSZ43TW5+5I1VhXV+zRPNACgnjmQ
SnFjDhA7v+5CGaZO5/jOxhw=
=93mp
-----END PGP SIGNATURE-----

[-- Attachment #2: numa-if --]
[-- Type: text/plain, Size: 24459 bytes --]

		      Thoughts about a NUMA API

Ulrich Drepper
Red Hat, Inc.
Time-stamp: <2004-04-29 01:18:32 drepper>

*** Very early draft.  I'll clean up the interface when I get some positive
*** feedback.

The technology used in NUMA machines is still evolving which means
that any proposed interface will fall short over time.  We cannot
think about every possibility and nuance the hardware designers come
up with.  The following is a list of assumptions made for this
document.  Some assumptions will be too general for some
implementations which allows simplication.  But the interfaces should
cover more designs.

1.  Non-uniform resources are processors and memory

2.  The address spaces of processors overlap

3.  Possible measure: distance of processors

    The distance is measured by the minimal difference of cost of
    accessing memory.

    ~ SMT and multi-core (MC) processors share some processor cache;

    ~ processors on the same SMP node (which might just be one
      procesor in size) have the same distance, which is larger
      than SMT/MC distance

    ~ processors on different NUMA nodes increases in distance with each
      interconnect which has to be used.

4.  The machine's architecture can change over time.

    ~ hotplug CPUs/RAM

    ~ dynamically enabling/disabling parts of the machine based on
      resource requirements

Example
=======

  +----------------------------------------+
  | +--------------+      +--------------+ |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T1|    |T1| |      | |T1|    |T1| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T2|    |T2| |      | |T2|    |T2| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | |  C1      C2  |      |  C1      C2  | |
  | +------P1------+      +------P2------+ |
  |                                        |
  | +------------------------------------+ |
  | |                 M1                 | |
  | +------------------------------------+ |
  +-------------------N1-------------------+

  +----------------------------------------+
  | +--------------+      +--------------+ |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T1|    |T1| |      | |T1|    |T1| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T2|    |T2| |      | |T2|    |T2| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | |  C1      C2  |      |  C1      C2  | |
  | +------P1------+      +------P2------+ |
  |                                        |
  | +------------------------------------+ |
  | |                 M2                 | |
  | +------------------------------------+ |
  +-------------------N2-------------------+

  +----------------------------------------+
  | +--------------+      +--------------+ |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T1|    |T1| |      | |T1|    |T1| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T2|    |T2| |      | |T2|    |T2| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | |  C1      C2  |      |  C1      C2  | |
  | +------P1------+      +------P2------+ |
  |                                        |
  | +------------------------------------+ |
  | |                 M3                 | |
  | +------------------------------------+ |
  +-------------------N3-------------------+

These are three NUMA blocks, each consisting of two SMP processors,
each of which has two cores which by itself have two threads.  We use
the notation T1:C2:P2:N3 for the first thread, in the second core, in
the second processor, on the third node.  The main memory in the nodes
is represented by M1, M2, M3.

A simplistics measure in this case could be: requiring to access the
next level of memory doubles the cost.  So we might have the following
costs:

  T1:C1:P1:N1 <-> T2:C1:P1:N1   ==  1
  T1:C1:P1:N1 <-> T1:C2:P1:N1   ==  2
  T1:C1:P1:N1 <-> T1:C1:P2:N1   ==  4
  T1:C1:P1:N1 <-> T1:C1:P1:N2   ==  8
  T1:C1:P1:N1 <-> T1:C1:P1:N3   ==  16 (i.e., 2 * 8 since two interconnect
                                        are used)

It might be better to compute the distance based on real memory access
costs.  The above is just an example.

The above costs automatically take into account when the main memory
of a node has to be used or when data is shared in caches.

A second cost does not take the sharing of data between processors
into account but instead measures access to data stored in a specific
memory node.

  T1:C1:P1:N1 ->  M1            == 4
  T1:C1:P1:N1 ->  M2            == 8
  T1:C1:P1:N1 ->  M3            == 16

This cost can be derived from the more detailed CPU-to-CPU cost but
since there can be memory nodes without CPUs and often it is not
sharing data between CPUs (but instead access to stored memory) which
is important, this simplified cost is useful, too.

Interfaces
==========

The interfaces can be grouped:

1. Topology.  Programs need to know about the machine's layout.

2. Placement/affinity

   ~ of execution
   ~ of memory allocation

3. Realignment: adjust placement/affinity to new situation

4: Temporal changes

Topology Interfaces
-------------------

Two different types of information must be accessible:

1.  enumeration of the memory hierarchies

    This includes SMT/MC

2.  distance

The fundamental data type is a bitset with each bit representing a
processor.  glibc defines cpu_set_t.  The size is arbitrarily large.
We might introduce interfaces to dynamically allocate them.  For now,
cpu_set_t is a fixed-size type.

CPU_SETSIZE                        number of processors in cpu_set_t

CPU_SET_S(cpu, setsize, cpuset)    set bit corresponding to CPU in CPUSET
CPU_CLR_S(cpu, setsize, cpuset)    clear bit corresponding to CPU in CPUSET
CPU_ISSET_S(cpu, setsize, cpuset)  check whether bit corresponding to CPU is set
CPU_ZERO_S(setsize, cpuset)        clear set

CPU_EQUAL_S(setsize1, cpuset1, setsize2, cpuset2)
                                   Check whether the set bits in the two
                                   sets match.
CPU_EQUAL(cpuset1, cpuset2)  CPU_EQUAL_S(sizeof(cpu_set_t), cpuset1,
                                         sizeof(cpu_set_t), cpuset2)

CPU_SET(cpu, cpuset)         CPU_SET_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_CLR(cpu, cpuset)         CPU_CLR_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_ISSET(cpu, cpuset)       CPU_ISSET_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_ZERO(cpuset)             CPU_ZERO_S(sizeof(cpu_set_t), cpuset)

We probably need the following:

CPU_AND_S(destsize, destset, setsize, srcset1, set2size, srcset2)

   logical AND of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

CPU_OR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
   logical OR of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

CPU_XOR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
   logical XOR of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

For dynamic allocation:

__cpu_mask             type for array element in cpu_set_t

CPU_ALLOC_SIZE(count)  number of bytes needed to represent cpu_set_t
                       which can at least represent CPU number COUNT

CPU_ALLOC(count)       allocate cpu_set_t which can represent at least
                       represent CPU number COUNT

CPU_FREE(cpuset)       free CPU set previously allocated with CPU_ALLOC()

Maybe interfaces to iterate over set are useful (C++ interface).

A similar type is defined for the representation of memory nodes.
Each processor is associated with one memory node and each memory node
can have zero or more processors associated.

memnode_set_t        basic type

MEMNODE_SET_S(node, memnodesize, memnodeset)
                    set bit corresponding bit to NODE in MEMNODESET
MEMNODE_CLR_S(node, memnodesize, memnodeset)
                    clear bit corresponding to NODE in MEMNODESET
MEMNODE_ISSET_S(node, memnodesize, memnodeset)
                    check whether bit corresponding to NODE is set
MEMNODE_ZERO_S(memnodesize, memnodeset)        clear set

MEMNODE_EQUAL_S(setsize1, memnodeset1, setsize2, memnodeset2)
                                   Check whether the set bits in the two
                                   sets match.
We probably need the following:

MEMNODE_AND_S(destsize, destset, setsize, srcset1, set2size, srcset2)

   logical AND of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

MEMNODE_OR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
   logical OR of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

MEMNODE_XOR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
   logical XOR of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

For dynamic allocation:

__memnode_mask         type for array element in memnode_set_t

MEMNODE_ALLOC_SIZE(count)  see CPU_ALLOC_SIZE()

MEMNODE_ALLOC(count)       see CPU_ALLOC()

MEMNODE_FREE(cpuset)       see CPU_FREE()

To determine the topology:

int NUMA_cpu_count(unsigned int *countp)

  Return in *COUNTP the number of online CPUs.  The
  sysconf(_SC_NPROCESSORS_ONLN) information might be sufficient, though.
  Returning an error can signal that the NUMA support is not present.

int NUMA_cpu_all(size_t destsize, cpu_set_t dest)

  Set bits for all (currently) available CPUs.

int NUMA_cpu_self(size_t destsize, cpu_set_t dest)

  Set bit for current processor

int NUMA_cpu_self_idx(void)

  Return index in cpu_set_t for current processor.

int NUMA_cpu_at_level(size_t destsize, cpu_set_t dest, size_t srcsize,
                      const cpu_set_t src, int level)

  Fill DEST with the bitmap which has all the bits corresponding to
  processors which are (currently) LEVEL or less levels away from any
  processor in SRC.

  In the simplest case on bit is set in SRC.  Level 1 might be used
  to find out all SMT siblings.  If more than one bit is set more the
  search is started from all of them.

  NB: the is "level", not "distance".  Since the distance could be
  relative to the access cost there need not be sequential values which
  can be used in iteration.  With this interface we can go on incrementing
  level until no further processors is found which could be signalled
  by an return value.

int NUMA_cpu_distance(int *minp, int *maxp, size_t setsize,
                      const cpu_set_t set)

  Determine the minimum and maximum distance between nodes in SET.

  This is the distance which is a measure for the cost of sharing
  memory.

  Usually two bits are set.  If more bits are set the spread between min
  and max is useful.

int NUMA_mem_main_level(int cpuidx, int *levelp)

  Return in *LEVELP the level where the local main memory for processor
  CPUIDX is.

int NUMA_memnode_count(unsigned int *countp)

  Return in *COUNTP the number of online memnodes.

int NUMA_memnode_all(size_t destsize, memnode_set_t dest)

  Set bits for all (currently) available memnodes.

int NUMA_cpu_to_memnode(size_t cpusetsize, const cpu_set_t cpuset,
                        size_t memnodesize, memnode_set_t memnodeset)

  Set bits in MEMNODESET which correspond to memory node which are local
  to any of the CPUs represented by bits set in CPUSET.

int NUMA_memnode_to_cpu(size_t memnodesize, const memnode_set_t memnodeset,
                        size_t cpusetsize, cpu_set_t cpuset)

  Set bits in CPUSET which correspond to CPUs which are local
  to any of the memory nodes represented by bits set in MEMNODESET.

int NUMA_mem_distance(int *minp, int *maxp, void *ptr, size_t setsize,
                      const memnode_set_t set)

  Determine the minimum and maximum level difference to the memory pointed
  to by Ptr from any of the CPUs in SET.

  Usually one bit is set in SET.

  If the difference between *MINP and the value returned from
  NUMA_mem_main_level() is zero, the memory is local to at least one CPU
  in set.  If the difference between *MAXP and the NUMA_mem_main_level()
  value is zero, the memory is local to all CPUs.

int NUMA_cpu_mem_cost(int *minp, int *maxp, size_t cpusetsize,
                      const cpu_set_t cpuset, size_t memsetsize,
                      const memnode_set_t memnodeset)

  Compute minimum and maximum access costs of processors in CPUSET to
  any of the memory nodes in MEMNODESET.

Example: Determine CPUs on neighbor "nodes"

  cpu_set_t level0;
  CPU_SET(level0, the_cpu);

  cpu_set_t levelN;
  NUMA_cpu_at_level(levelN, level0, N);

  cpu_set_t levelNp1;
  NUMA_cpu_at_level(levelNp1, level0, N + 1);

  CPU_XOR(levelNp1, levelNp1, levelN);

Given a CPU index, the CPUs at leavel N are determined, then those at
level N+1.  The difference (XOR) is the set of processors at level N+1
from the given CPU.

Memory Information
------------------

It is necessary to know something about the memory at a given level.
For instance, level 1 might be "level 1 CPU cache", level 4 might be
"main memory".

int NUMA_mem_info_size(int level, int cpuidx, NUMA_size_t *size)
int NUMA_mem_info_associativity(int level, int cpuidx, NUMA_size_t *size)
int NUMA_mem_info_linesize(int level, int cpuidx, NUMA_size_t *size)

*_size applies to all kinds of memory.  _associativity and _linesize
mainly apply to caches.  Maybe it's useful for main memory, too.  If not
an error could be returned.

int NUMA_mem_total(int memnodeidx, NUMA_size_t *size)
int NUMA_mem_avail(int memnodeidx, NUMA_size_t *size)

The total memory and available memory on memory node MEMNODEIDX.

Placement/Affinity Interfaces
-----------------------------

int NUMA_mem_set_home(pid_t pid, size_t setsize, memnode_set_t set)

  Install SET as mask of preferred nodes for memory allocation for process
  PID.  This applies only to directly attached memory (NUMA_mem_main_level()).
  If more than one bit is set in SET the memory allocation can be bread
  accross all the local memory for the CPUs in the set.

int NUMA_mem_get_home(pid_t pid, size_t setsize, memnode_set_t set)

  Return currently installed prefferred node set.

int NUMA_mem_set_home_thread(pthread_t th, size_t setsize, memnode_set_t set)

  Similar, but limited to the given thread.

int NUMA_mem_get_home_thread(pthread_t th, size_t setsize, memnode_set_t set)

  Likewise to retrieve the information.

int NUMA_aff_set_cpu(pid_t pid, size_t setsize, cpu_set_t set, int hard)

  Set affinity mask for process PID to the processors in SET.  There are
  two masks: the hard and the soft.  No processor not in the hard mask
  can ever be used.  The soft mask is a recommendation.

int NUMA_aff_get_cpu(pid_t pid, size_t setsize, cpu_set_t set, int hard)

  The corresponding interface to get the data.

int NUMA_aff_set_cpu_thread(pthread_t th, size_t setsize, cpu_set_t set,
                            int hard)

  Similar to NUMA_aff_set_cpu() but for the given thread.

int NUMA_aff_get_cpu_thread(pthread_t th, size_t setsize, cpu_set_t set,
                            int hard)

  Get the data.

The "hard" variants are basically the existing sched_setaffinity and
pthread_setaffinity.  The soft and hard maps are maintained separately.

void *NUMA_mem_alloc_local(NUMA_size_t size, int spill, int interleave)

  Allocate SIZE bytes local to the current process, regardless of the
  registered preferred memory node mask.  Unless SPILL is nonzero the
  allocation fails if no memory available locally.  If SPILL is nonzero
  memory at greater distances is considers.  This is a convenience
  interface, it could be implemented using NUMA_mem_alloc() below.

  Possible extension: SPILL could specify how far away the memory can
  be spilled.  For instance, the value 1 could mean one NUMA node way,
  2 for up to 2 NUMA nodes away etc.

  If INTERLEAVE is nonzero the memory is allocated in interleaved form
  from all the nodes specified.  Otherwise all memory comes from one node.

void *NUMA_mem_alloc_preferred(NUMA_size_t size, int spill, int interleave)

  Allocate memory according to the mask registered with NUMA_aff_mem_home
  or NUMA_aff_mem_home_thread.  SPILL and INTERLEAVE are handled as in
  NUMA_mem_alloc_local.

void *NUMA_mem_alloc(NUMA_size_t size, size_t setsize, memnode_set_t set,
                     int spill, int interleave)

  Allocate memory on any of the nodes in SET

What chunks of memory can be allocated is debateble.  It might make sense
to restrict all sizes to page size granularity.  Or at least round all
values up.

??? Should the granularity be configurable ???

void NUMA_mem_free(void *)

  Obviously, free the memory.

int NUMA_mem_get_nodes(void *addr, size_t destsize, memnode_set_t dest)

  The function will set the bits in DEST which represent processors
  which are local to the memory pointed to by ADDR.

int NUMA_mem_bind(void *addr, size_t size, size_t setsize, memnode_set_t set,
                  int spill)

  The memory in the range of [addr,addr+size) in the current process is bound
  to one of the nodes represented in SET.  Unless SPILL is nonzero  the
  call will fail if no memory is available on the nodes.

int NUMA_mem_get_nodes(void *addr, size_t size,
                       size_t destsize, memnode_set_t dest)

  The function returns information about the nodes on which the memory
  in the range [addr,addr+size) is allocated.  If the memory is not continously
  allocated (or in case of multi-threaded or multi-core processors) this can
  mean more than one bit is set in the result set.

Realignment
-----------

CPU sets can be realligned at any time using NUMA_aff_cpu() etc.

void *NUMA_mem_relocate(void *ptr, size_t setsize, memnode_set_t set)

  relocate the content of the memory pointed to be PTR to a node in SET.
  Return the new address.

Temporal Changes
----------------

The machine configuration can change over time.  New processors can
come online, others go offline, memory banks are switched on or off.
The above interfaces return information about the currently active
configuraiton.  There is possibly the danger that data sets from
different configurations are used.

One solution would be to require an open()-like function which
retrieves all the information in one step and all the interfaces
mentioned above will use that cached data.  The problem with this is
that if the configuration changes the decision made using the cached
data is outdated.  Second, the amount of data which is needed can be
big or, more likely, expensive to get even though only parts of the
information are used.

A different possibility would be to provide a simple callback which
returns a unique ID for each configuration.  Any use of the topology
interfaces would then start and end with a call to function to get the
ID.  If the two values differ, the collected data is inconsistent.
This would eliminate the second problem mentioned above, but not the
first.

A third possibility is to register a signal handler with the kernel so
that the kernel can send a signal whenever the configuration changes.
Alternatively, a /proc/file or netlink socket could be used to signal
interested parties (who then could send a signal if necessary).  Using
d-bus is posssible, too.  This notification could not only be used to
notice changes while reading the topology, it could also get a process
at any time to reconsider the current decision and reorganize the
processor/memory usage.

From these possibilities the d-bus route seems to be the most
appealing since d-bus already receives this kind of information from
the kernel and any number of processes can receive them.

Comparison with libnuma
=======================

nodemask_t:

  Unlike nodemask_t, cpu_set_t is already in use in glibc.  The affinity
  interfaces use it so there is not need to duplicate the functionality
  and no need to define other versions of the affinity interfaces.

  Furthermore, the nodemask_t type is of fixed size.  The cpu_set_t
  has a convenience version which is of fixed size but can be of
  arbitrary size.  This is important has a bit of math shows:

    Assume a processor with four cores and 4 threads each

    Four such processors on a single NUMA node

    That's a total of 64 virtual processors for one node.  With 32 such
    nodes the 1024 processors of cpu_set_t would be filled.  And we do
    not want to mention the total of 64 supported processors in libnuma's
    nodemask_t.  To be future safe the bitset size must be variable.

  In addition there is the type memnode_set_t which represents memory node.
  It is possible to have memory nodes without processors so only a cpu_set_t
  is not sufficient.

nodemask_zero()  -->  CPU_ZERO() which is already in glibc
nodemask_set()   -->  CPU_SET() ditto
nodemask_clr()   -->  CPU_CLR() ditto
nodemask_isset() -->  CPU_ISSET() ditto

nodemask_equal() -->  CPU_EQUAL()

Plus the appropriate macros to handle memnode_t.

numa_available() -->  NUMA_cpu_count()  for instance

numa_max_node()  -->  either NUMA_cpu_count()
                      or NUMA_CPU_all()

numa_homenode()  -->  NUMA_mem_get_home() or NUMA_aff_get_cpu()
                      or NUMA_aff_get_cpu_thread() or NUMA_cpu_self()

  The concept of a never-changing home node strikes me as odd.  Especially
  with hot-swap CPUs.  Declaring one or more CPUs the home nodes is fine.
  The default can be cpu the thread started on.

numa_node_size() --> NUMA_mem_avail()

  The main memory is at level NUMA_mem_main_level()

numa_pagesize()  -->  nothing yet since useless

  It is not clear to me what this really should to.  I.e., the interface
  of numa_pagesize() seems useless.  With no argument, the pagesize which
  can be determine is the pagesize of the system.  When hugepages etc
  come into play it is necessary to provide a pointer to a memory address
  so it can be determined which kind of memory it is.

??? Should we add NUMA_size_t NUMA_pagesize(void *addr) ???

numa_all_nodes --> global variables are *EVIL*

  Use NUMA_cpu_all()

numa_no_nodes  --> global variables are *EVIL*

  cpu_set_t s;
  CPU_ZERO(s);

numa_bind()    --> NUMA_mem_set_home() or NUMA_mem_set_home_thread()
                   or NUMA_aff_set_cpu(() or NUMA_aff_set_cpu_thread()

  numa_bind() misses A LOT of flexibility.  First, memory and CPU need
  node be the same nodes. Second, thread handling is missing.  Third,
  hard versus soft requirements are not handled for CPU usage.

numa_set_interleave_mask() --> see comment
numa_get_interleave_mask() --> see comment
numa_get_interleave_node() --> see comment
numa_alloc_interleaved_subset() --> see comment
numa_alloc_interleaved() --> see comment
numa_interleave_memory() --> see comment

  I do not think that interleaving should be a completely separate mechanism
  next to normal memory allocation.  Instead it is a logical extension of
  memory allocation.  Interleaving is a parameter for the memory allocation
  functions like NUMA_mem_alloc().

numa_set_homenode()  -->  NUMA_mem_set_home() or NUMA_aff_set_cpu()
                          or NUMA_aff_set_cpu_thread() or NUMA_cpu_self()

numa_set_localalloc() --> NUMA_mem_set_home() or NUMA_mem_set_home_thread()

numa_set_membind()   --> NUMA_mem_bind()

numa_get_membind()   --> NUMA_get_nodes()

numa_alloc_onnode()  -->  NUMA_mem_alloc()

numa_alloc_local()   -->  NUMA_mem_alloc_local()

numa_alloc()         -->  NUMA_mem_alloc_preferred()

numa_free()          -->  NUMA_mem_free()

numa_tonode_memory() -->  NUMA_mem_relocate()

numa_setlocal_memory() -->  NUMA_mem_relocate()

numa_police_memory()  -->  nothing yet

  I don't see why this is necessary.  Yes, address space allocation and
  the actual allocation of memory are two steps.  But this should be
  taken case of by the allocation functions (if necessary).  To support
  memory allocation with other interfaces then those described here and
  magically treat them in the "NUMA-way" seems dumb.

numa_run_on_node_mask() --> NUMA_aff_set_cpu() or NUMA_aff_set_cpu_thread()

numa_run_on_node() --> NUMA_aff_set_cpu() or NUMA_aff_set_cpu_thread()

numa_set_bind_policy() --> too coarse grained

  This cannot be a process property.  And it must be possible to change
  it from another thread, so the interface is completely broken.  Beside,
  it seems much more useful to differentiate between hard and soft masks
  since this allows, if necessary, to spill over to other nodes.  The
  NUMA_aff_set_cpu() and NUMA_aff_set_cpu_thread() allow specifying
  two masks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
  2004-04-30  7:35 NUMA API Ulrich Drepper
@ 2004-04-30  8:30 ` William Lee Irwin III
  2004-05-03 18:37   ` Ulrich Drepper
  2004-04-30  8:49 ` Paul Jackson
  2004-05-03 12:48 ` NUMA API - wish list Zoltan Menyhart
  2 siblings, 1 reply; 12+ messages in thread
From: William Lee Irwin III @ 2004-04-30  8:30 UTC (permalink / raw
  To: Ulrich Drepper; +Cc: Linux Kernel

On Fri, Apr 30, 2004 at 12:35:26AM -0700, Ulrich Drepper wrote:
> In the last weeks I have been working on designing a new API for a NUMA
> support library.  I am aware of the code in libnuma by ak but this code
> has many shortcomings:
> ~ inadequate topology discovery
> ~ fixed cpu set size
> ~ no clear separation of memory nodes
> ~ no inclusion of SMT/multicore in the cpu hierarchy
> ~ awkward (at best) memory allocation interface
> ~ etc etc
> and last but not least
> ~ a completely unacceptable library interface (e.g., global variables as
> part of the API, WTF?)

Regardless of issues addressed, Andi's been working with everyone for
something on the order of 12+ months and this is out of the blue. I very
very strongly suggest that you take up each of these issues with him so
that they can be addressed as individual incremental improvements to the
API everyone's been working with for all that time as opposed to screwing
the world (esp. now that commodity NUMA boxen are becoming more prevalent)
with transparent and deliberate distro-competition motivated API skew.

-- wli

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
  2004-04-30  7:35 NUMA API Ulrich Drepper
  2004-04-30  8:30 ` William Lee Irwin III
@ 2004-04-30  8:49 ` Paul Jackson
  2004-04-30  9:50   ` William Lee Irwin III
  2004-05-03 12:48 ` NUMA API - wish list Zoltan Menyhart
  2 siblings, 1 reply; 12+ messages in thread
From: Paul Jackson @ 2004-04-30  8:49 UTC (permalink / raw
  To: Ulrich Drepper; +Cc: linux-kernel

> Please direct comments to me.  In case there is interest I can set up a
> separate mailing list since lkml is probably not the best venue.

Thanks for posting this.

If not the kernel mailing list, then could you specify some other
existing public list?  Sometimes useful feedback comes from the
interaction of several people responding, which is less likely to
happen if it is all funneled through one person.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
  2004-04-30  8:49 ` Paul Jackson
@ 2004-04-30  9:50   ` William Lee Irwin III
  0 siblings, 0 replies; 12+ messages in thread
From: William Lee Irwin III @ 2004-04-30  9:50 UTC (permalink / raw
  To: Paul Jackson; +Cc: Ulrich Drepper, linux-kernel

At some point in the past, Uli wrote:
>> Please direct comments to me.  In case there is interest I can set up a
>> separate mailing list since lkml is probably not the best venue.

On Fri, Apr 30, 2004 at 01:49:33AM -0700, Paul Jackson wrote:
> Thanks for posting this.
> If not the kernel mailing list, then could you specify some other
> existing public list?  Sometimes useful feedback comes from the
> interaction of several people responding, which is less likely to
> happen if it is all funneled through one person.

The real problem with this is that f's been maintaining and bugfixing
and handling feature requests for people who are actually going to
depend on this stuff for a rather long time, and the total rewrite here
throws all that work to make things work for those who rely on the stuff
away in addition to creating a brand new vendor skew problem from scratch.

Uli likely has legitimate poihts in need of addressing. From-scratch
rewrites are not the proper ways to address them, especially not when
such a very strong precedent and various 3rd-parties' reliance on the
preexisting API's and codebases are established.

The proper methods for addressing these issues are by incrementally
improving f's codebase and fixing the bugs and/or limitations discussed
(e.g. hotplugging vs. NUMA API issues). What Uli has expressed is not a
sound basis for a ground-up, from-scratch API implementation. The issues
Uli wants to address are bugfixes and extensions, and should be
required to go through the same procedures and review as such, and these
in turn require working with the preexisting codebase, not wild from-
scratch rewrites of the known universe. Especially not with the
extremely transparent ulterior motives for incompatible API's proposed
on the day of SuSE's freeze.

I'm all in favor of the best. As the deficiencies are pointed out, I
won't rest until these are fixed and the implementation is the best.
But this proposed API divergence is not how it should be made so. Being
the best means having a coherent story, and bickering and contrived
incompatibilities betweeen distros is not how coherent stories and
customer satisfaction happen.

-- wli

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
       [not found] <1QAMU-4gf-15@gated-at.bofh.it>
@ 2004-04-30 20:01 ` Andi Kleen
  2004-05-01  5:15   ` Martin J. Bligh
  2004-05-03 18:34   ` Ulrich Drepper
  2004-04-30 20:39 ` Andi Kleen
  1 sibling, 2 replies; 12+ messages in thread
From: Andi Kleen @ 2004-04-30 20:01 UTC (permalink / raw
  To: Ulrich Drepper; +Cc: linux-kernel

Ulrich Drepper <drepper@redhat.com> writes:

>In the last weeks I have been working on designing a new API for a NUMA
>support library.  I am aware of the code in libnuma by ak but this code
>has many shortcomings:

> ~ a completely unacceptable library interface (e.g., global variables as
> part of the API, WTF?)

You mean numa_no_nodes et.al. ? 

This is essentially static data that never changes (like in6addr_any).
numa_all_nodes could maybe in future change with node hotplug support,
but even then it will be a global property.

Everything else is thread local.

> ~ inadequate topology discovery

I believe it is good enough for current machines, at least 
until there is enough experience to really figure out what
node discovery is needed.. I have seen some proposals
for complex graph based descriptions, but so far I have seen
nothing that could really take advantage of something so fancy.
If it should be really needed it can be added later.

IMHO we just do not know enough right now to design a good topology
discovery interface. Until that is fixed it is best to err on the
side of simplicity.

> ~ fixed cpu set size

That is wrong. The latest version does not have fixed cpu set size.

> ~ no inclusion of SMT/multicore in the cpu hierarchy

Not sure why you would care about that. NUMA is only about "what CPUs
belong to which memory block". While multicore can affect the number
of CPUs in a node the actually shared packages only have cache
effects, but not "sticky memory" effects.  To handle cache effects all
you need to do is to change scheduling, not NUMA policy. Supporting
cache policy in the NUMA policy would result in a quite complex
optimization problem on how to tune the scheduler. But the whole point
why at least I started libnuma initially was to avoid this complex
problem, and just use simple hints. For this reasons putting cache
policy into the memory policy is imho quite misguided.

> As specified, the implementation of the interface is designed with only
> the requirements of a program on NUMA hardware in mind.  I have paid no
> attention to the currently proposed kernel extensions.  If the latter do
> not really allow implementing the functionality programmers need then it
> is wasted efforts.

Well, I spent a lot of time talking to various users; and IMHO
it matches the needs of a lot of them. I did not add all the features
everybody wanted, but that was simply not possible and still comming
up with a reasonable design.

> For instance, I think the way memory allocated in interleaved fashion is
> not "ideal".  Interleaved allocation is a property of a specific
> allocation.  Global states for processes (or threads) are a terrible way
> to handle this and other properties since it requires the programmer to
> constantly switch the mode back and forth since any part of the runtime
> might be NUMA aware and reset the mode.

If you do not want per process state just use the allocation function
in libnuma instead. They use mbind() and have no per thread state,
only per VMA state.

The per process state is needed for numactl though.

I kept the support for this visible in libnuma to make it easier to convert
old code to this (just wrap some code with a policy) For designed from 
scratch programs it is probably better to use the allocation functions
with mbind directly.

> Also, the concept of hard/soft sets for CPUs is useful.  Likewise
> "spilling" over to other memory nodes.  Usually using NUMA means hinting
> the desired configuration to the system.  It'll be used whenever
> possible.  If it is not possible (for instance, if a given processor is
> not available) it is mostly no good idea to completely fail the

Agreed. That is why prefered and bind are different policies
and you can switch between them in libnuma. 

> execution.  Instead a less optimal resource should be used.  For memory
> it is hard to know how much memory on which node is in use etc.

numa_node_size()

> Another missing feature in libnuma and the current kernel design is
> support for changes in the configuration.  CPUs might be added or
> removed, likewise memory.  Additional interconnects between NUMA blocks
> might be added etc.

It is version 1.0. So far all the CPU hotplug code seems to be
still too far in flux to really do something good about it. I expect 
once all this settles down libnuma will also grow some support
for dynamic reconfiguration. 

Comments on some (not all of your criticisms):

>
> Comparison with libnuma
> =======================
>
> nodemask_t:
>
>   Unlike nodemask_t, cpu_set_t is already in use in glibc.  The affinity
>   interfaces use it so there is not need to duplicate the functionality
>   and no need to define other versions of the affinity interfaces.
>
>   Furthermore, the nodemask_t type is of fixed size.  The cpu_set_t
>   has a convenience version which is of fixed size but can be of
>   arbitrary size.  This is important has a bit of math shows:
>
>     Assume a processor with four cores and 4 threads each
>
>     Four such processors on a single NUMA node
>
>     That's a total of 64 virtual processors for one node.  With 32 such
>     nodes the 1024 processors of cpu_set_t would be filled.  And we do
>     not want to mention the total of 64 supported processors in libnuma's
>     nodemask_t.  To be future safe the bitset size must be variable.

nodemask_t has nothing to do with virtual CPUs, only with nodes
(= memory controllers) 

There is no fixed size in the current version for CPUs.
There was in some earlier version, but I quickly dropped that because
it was indeed a bad idea.

There is a fixed size nodemask type though, although its upper limit
is extremly high (4096 nodes on IA64).  I traded this limit 
for simplicity of use.

> numa_bind()    --> NUMA_mem_set_home() or NUMA_mem_set_home_thread()
>                   or NUMA_aff_set_cpu(() or NUMA_aff_set_cpu_thread()
>
>  numa_bind() misses A LOT of flexibility.  First, memory and CPU need
>  node be the same nodes. Second, thread handling is missing.  Third,
>  hard versus soft requirements are not handled for CPU usage.

Correct. That is why lower level functions exist too. numa_bind is
merely a comfortable high level utility function to make libnuma more
pleasant to use for many (but not all) users. It trades some
flexibility to cater to the common case.

> numa_police_memory()  -->  nothing yet
> 
 > I don't see why this is necessary.  Yes, address space allocation and
 > the actual allocation of memory are two steps.  But this should be
 > taken case of by the allocation functions (if necessary).  To support
 > memory allocation with other interfaces then those described here and
 > magically treat them in the "NUMA-way" seems dumb.

You need process policy for command line policy. To make converting
old programs easier I opted to expose it in libnuma too. For new programs
I agree it is better to just use the allocator functions.

> numa_set_bind_policy() --> too coarse grained
>
>  This cannot be a process property.  And it must be possible to change

It is a per thread property.

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
       [not found] <1QAMU-4gf-15@gated-at.bofh.it>
  2004-04-30 20:01 ` NUMA API Andi Kleen
@ 2004-04-30 20:39 ` Andi Kleen
  1 sibling, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2004-04-30 20:39 UTC (permalink / raw
  To: Ulrich Drepper; +Cc: linux-kernel

Ulrich Drepper <drepper@redhat.com> writes:

[my apologies if this turns up twice. I have had some problems with
the mailer]

>In the last weeks I have been working on designing a new API for a NUMA
>support library.  I am aware of the code in libnuma by ak but this code
>has many shortcomings:

> ~ a completely unacceptable library interface (e.g., global variables as
> part of the API, WTF?)

You mean numa_no_nodes et.al. ? 

This is essentially static data that never changes (like in6addr_any).
numa_all_nodes could maybe in future change with node hotplug support,
but even then it will be a global property.

Everything else is thread local.

> ~ inadequate topology discovery

I believe it is good enough for current machines, at least 
until there is enough experience to really figure out what
node discovery is needed.. I have seen some proposals
for complex graph based descriptions, but so far I have seen
nothing that could really take advantage of something so fancy.
If it should be really needed it can be added later.

IMHO we just do not know enough right now to design a good topology
discovery interface. Until that is fixed it is best to err on the
side of simplicity.

> ~ fixed cpu set size

That is wrong. The latest version does not have fixed cpu set size.

> ~ no inclusion of SMT/multicore in the cpu hierarchy

Not sure why you would care about that. NUMA is only about "what CPUs
belong to which memory block". While multicore can affect the number
of CPUs in a node the actually shared packages only have cache
effects, but not "sticky memory" effects.  To handle cache effects all
you need to do is to change scheduling, not NUMA policy. Supporting
cache policy in the NUMA policy would result in a quite complex
optimization problem on how to tune the scheduler. But the whole point
why at least I started libnuma initially was to avoid this complex
problem, and just use simple hints. For this reasons putting cache
policy into the memory policy is imho quite misguided.

> As specified, the implementation of the interface is designed with only
> the requirements of a program on NUMA hardware in mind.  I have paid no
> attention to the currently proposed kernel extensions.  If the latter do
> not really allow implementing the functionality programmers need then it
> is wasted efforts.

Well, I spent a lot of time talking to various users; and IMHO
it matches the needs of a lot of them. I did not add all the features
everybody wanted, but that was simply not possible and still comming
up with a reasonable design.

> For instance, I think the way memory allocated in interleaved fashion is
> not "ideal".  Interleaved allocation is a property of a specific
> allocation.  Global states for processes (or threads) are a terrible way
> to handle this and other properties since it requires the programmer to
> constantly switch the mode back and forth since any part of the runtime
> might be NUMA aware and reset the mode.

If you do not want per process state just use the allocation function
in libnuma instead. They use mbind() and have no per thread state,
only per VMA state.

The per process state is needed for numactl though.

I kept the support for this visible in libnuma to make it easier to convert
old code to this (just wrap some code with a policy) For designed from 
scratch programs it is probably better to use the allocation functions
with mbind directly.

> Also, the concept of hard/soft sets for CPUs is useful.  Likewise
> "spilling" over to other memory nodes.  Usually using NUMA means hinting
> the desired configuration to the system.  It'll be used whenever
> possible.  If it is not possible (for instance, if a given processor is
> not available) it is mostly no good idea to completely fail the

Agreed. That is why prefered and bind are different policies
and you can switch between them in libnuma. 

> execution.  Instead a less optimal resource should be used.  For memory
> it is hard to know how much memory on which node is in use etc.

numa_node_size()

> Another missing feature in libnuma and the current kernel design is
> support for changes in the configuration.  CPUs might be added or
> removed, likewise memory.  Additional interconnects between NUMA blocks
> might be added etc.

It is version 1.0. So far all the CPU hotplug code seems to be
still too far in flux to really do something good about it. I expect 
once all this settles down libnuma will also grow some support
for dynamic reconfiguration. 

Comments on some (not all of your criticisms):

>
> Comparison with libnuma
> =======================
>
> nodemask_t:
>
>   Unlike nodemask_t, cpu_set_t is already in use in glibc.  The affinity
>   interfaces use it so there is not need to duplicate the functionality
>   and no need to define other versions of the affinity interfaces.
>
>   Furthermore, the nodemask_t type is of fixed size.  The cpu_set_t
>   has a convenience version which is of fixed size but can be of
>   arbitrary size.  This is important has a bit of math shows:
>
>     Assume a processor with four cores and 4 threads each
>
>     Four such processors on a single NUMA node
>
>     That's a total of 64 virtual processors for one node.  With 32 such
>     nodes the 1024 processors of cpu_set_t would be filled.  And we do
>     not want to mention the total of 64 supported processors in libnuma's
>     nodemask_t.  To be future safe the bitset size must be variable.

nodemask_t has nothing to do with virtual CPUs, only with nodes
(= memory controllers) 

There is no fixed size in the current version for CPUs.
There was in some earlier version, but I quickly dropped that because
it was indeed a bad idea.

There is a fixed size nodemask type though, although its upper limit
is extremly high (4096 nodes on IA64).  I traded this limit 
for simplicity of use.

> numa_bind()    --> NUMA_mem_set_home() or NUMA_mem_set_home_thread()
>                   or NUMA_aff_set_cpu(() or NUMA_aff_set_cpu_thread()
>
>  numa_bind() misses A LOT of flexibility.  First, memory and CPU need
>  node be the same nodes. Second, thread handling is missing.  Third,
>  hard versus soft requirements are not handled for CPU usage.

Correct. That is why lower level functions exist too. numa_bind is
merely a comfortable high level utility function to make libnuma more
pleasant to use for many (but not all) users. It trades some
flexibility to cater to the common case.

> numa_police_memory()  -->  nothing yet
> 
 > I don't see why this is necessary.  Yes, address space allocation and
 > the actual allocation of memory are two steps.  But this should be
 > taken case of by the allocation functions (if necessary).  To support
 > memory allocation with other interfaces then those described here and
 > magically treat them in the "NUMA-way" seems dumb.

You need process policy for command line policy. To make converting
old programs easier I opted to expose it in libnuma too. For new programs
I agree it is better to just use the allocator functions.

> numa_set_bind_policy() --> too coarse grained
>
>  This cannot be a process property.  And it must be possible to change

It is a per thread property.

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
  2004-04-30 20:01 ` NUMA API Andi Kleen
@ 2004-05-01  5:15   ` Martin J. Bligh
  2004-05-03 18:34   ` Ulrich Drepper
  1 sibling, 0 replies; 12+ messages in thread
From: Martin J. Bligh @ 2004-05-01  5:15 UTC (permalink / raw
  To: Andi Kleen, Ulrich Drepper; +Cc: linux-kernel

>> As specified, the implementation of the interface is designed with only
>> the requirements of a program on NUMA hardware in mind.  I have paid no
>> attention to the currently proposed kernel extensions.  If the latter do
>> not really allow implementing the functionality programmers need then it
>> is wasted efforts.
> 
> Well, I spent a lot of time talking to various users; and IMHO
> it matches the needs of a lot of them. 

As have I, and the rest of IBM ... and what Andi has done (and the design
was discussed extensively with other people during the process) fulfills
the needs that we see out there.

> I did not add all the features everybody wanted, but that was simply 
> not possible and still comming up with a reasonable design.

Exactly ... this needs to be simple.

M.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API - wish list
  2004-04-30  7:35 NUMA API Ulrich Drepper
  2004-04-30  8:30 ` William Lee Irwin III
  2004-04-30  8:49 ` Paul Jackson
@ 2004-05-03 12:48 ` Zoltan Menyhart
  2004-05-03 17:57   ` Paul Jackson
  2 siblings, 1 reply; 12+ messages in thread
From: Zoltan Menyhart @ 2004-05-03 12:48 UTC (permalink / raw
  To: Ulrich Drepper, linux-kernel

Can you remember back the "old golden days" when there were no open(),
read(), lseek(), write(), mmap(), etc., and one had to tell explicitly
(job control punched cards) that s/he needed the sectors 123... 145 on
the disk on channel 6 unit 7 ?
Or somewhat more recently, one had to manage by hand the memory and the
overlays.
Now we are going to manage (from applications) the topology, CPU or
memory binding. Moreover, to have the applications resolve resources
management / dependency problems / conflicts among them...

The operating systems should provide for abstractions of the actual
HW platform: file system, virtual memory, shared CPUs, etc.

Why should an application care for the actual physical characteristics ?
Including counting nanoseconds of some HW resource access time ? We'll
end up with some completely un-portable applications.

I think an application should describe what it needs for its optimal run,
e.g.:
	- I need 3 * N (where N = 1, 2, 3,...) CPUs "very close"
	  together and 2.5 Gbytes / N real memory (working set size) for
	  each CPUs "very very close to" their respective CPUs
	- Should not it fit into a "domain", the CPUs have to be
	  "very very close" to each other 3 by 3
	- If no resources for even N == 1, do not start it at all
	- Use "gang scheduling" for them, otherwise I'll busy wait :-)
	- In addition, I need M CPUs + X Gbytes of memory
	  "where my previous group is" and I need a disk I/O path of
	  the capacity of 200 Mbytes / sec "more or less close to" my
	  memory
	- I need "some more" CPUs "somewhere" with some 100 Mbytes of
	  memory "preferably close to" the CPUs and 10 Mbytes / sec
	  TCP/IP bandwidth "close to" my memory 

	- I need 70 % of the CPU time on my CPUs (the scheduler can
	  select others for the 30 % of the time left)

	- O.K. should my request be too much, here is my minimal,
	  "degraded" configuration:...

The OS reserves the resources for the application (exec time assignment)
and reports the applications what of its needs have been granted.

When the application allocates some memory, it'll say: you know, this
is for the memory pool I've described in the 5th criteria.
When it creates threads, it'll say they are in the 2nd group of threads
mentioned at the 1st line

The work load manager / load balancer can negotiate other resource
assignment at any time with the application.
The work load manager / load balancer is free to move a collection of
resources from some NUMA domains to others, provided the application's
requirements are still met. (No hard binding.)

Billing is done accordingly :-)

As you do not need to know anything about SCSI LUNs, sector IDs, phy-
sical memory maps or the other applications when you compile your kernel,  
why should an application care for HW NUMA details ?

Thanks,

Zoltán Menyhárt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API - wish list
  2004-05-03 12:48 ` NUMA API - wish list Zoltan Menyhart
@ 2004-05-03 17:57   ` Paul Jackson
  0 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2004-05-03 17:57 UTC (permalink / raw
  To: Zoltan.Menyhart; +Cc: drepper, linux-kernel

> The operating systems should provide for abstractions of the actual ...

True ... so long as you don't confuse "operating system" with "kernel".

Most of what you describe can and should be in user space, as what I
call "system software", constructed of libraries, daemons, utilities
and specific language support.

Having the kernel support the abstraction of "file", to hide details of
sectors, channels and devices has been a great success.  But the kernel
doesn't need to support every such abstraction, such as in this case
"abstract computers" with certain amounts of compute, memory and i/o
resources.

Rather the kernel only needs to provide the essential primitives, such
as cpu and memory placement, jobs (as related set of tasks), and access
to primitive topology and hardware attributes.

(Your spam encoded from address "Zoltan.Menyhart_AT_bull.net@nospam.org"
is a minor annoyance ...).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
  2004-04-30 20:01 ` NUMA API Andi Kleen
  2004-05-01  5:15   ` Martin J. Bligh
@ 2004-05-03 18:34   ` Ulrich Drepper
  1 sibling, 0 replies; 12+ messages in thread
From: Ulrich Drepper @ 2004-05-03 18:34 UTC (permalink / raw
  To: Andi Kleen; +Cc: linux-kernel

Andi Kleen wrote:

> You mean numa_no_nodes et.al. ? 
> 
> This is essentially static data that never changes (like in6addr_any).
> numa_all_nodes could maybe in future change with node hotplug support,
> but even then it will be a global property.

And you don't see a problem with this?  You are hardcoding variables of
a certain size and layout.  This is against every good design principal.

There are other problems like not using a protected namespace.

> Everything else is thread local.

This, too, is not adequate.  It requires working on the 1-on-1 model.
Using user-level contexts (setcontext etc) is made very hard and
expensive.  There is not one state to change.  Using any of the code (or
functionality using the state) in signal handlers, possibly recursive,
will throw things in disorder.

There is no reason to not make the state explicit.

> I believe it is good enough for current machines, at least 
> until there is enough experience to really figure out what
> node discovery is needed..

That's the point.  We cannot start using an inadequate API now since one
will _never_ be able to get rid of it again.  We have accumulated
several examples of this in the past years.

The API design should be general enough to work for all the
architectures which are currently envisioned and must be extensible to
be extendable for future architectures.  Your API does not allow to
write adequate code even on some/many of today's architectures.

> I have seen some proposals
> for complex graph based descriptions, but so far I have seen
> nothing that could really take advantage of something so fancy.

I am proposing no graph-based descriptions.  And this is something where
you miss the point.  This is only the lowest level interface.  It
prorvides enough functionality to describe the machine architecture.  If
some fancy alternative representations are needed this is something for
a higher-level interface.

What must be avoided at all costs is programs peeking into /sys and
/proc to determine the topology.  First of all, this makes programs
architecture and even machine-specific.  Second, the /sys and /proc
format will change over time.  All these access must be hidden and the
NUMA library is just the place for that.

Saying

> If it should be really needed it can be added later.

just means we will get programs today which hardcode today's existing
and most probably inadequate representation of the topology in /sys and
/proc.

>>~ no inclusion of SMT/multicore in the cpu hierarchy
> 
> 
> Not sure why you would care about that.

These are two sides of the same coin.  Today we already have problems
with programs running on machines with SMT processors.  How can those
use pthread_setaffinity() to create theoptimal number of threads and
place them accordingly?  It requires magic /proc parsing for each and
every architecture.  The problem is exactly the same as with NUMA and
the interface extensions to cover MC/SMT as well are minimal.

> Well, I spent a lot of time talking to various users; and IMHO
> it matches the needs of a lot of them. I did not add all the features
> everybody wanted, but that was simply not possible and still comming
> up with a reasonable design.

And this means it should not be done?

> The per process state is needed for numactl though.
> 
> I kept the support for this visible in libnuma to make it easier to convert
> old code to this (just wrap some code with a policy) For designed from 
> scratch programs it is probably better to use the allocation functions
> with mbind directly.

The NUMA library interface should not be cluttered because of
considerations of legacy apps which need to be converted.  These are
separate issues, the design of the API must not be influenced by this.
The problem always has been that in such cases the correct interfaces
are not being used but instead the "easier to use" legacy interfaces are
used.

>>Also, the concept of hard/soft sets for CPUs is useful.  Likewise
>>"spilling" over to other memory nodes.  Usually using NUMA means hinting
>>the desired configuration to the system.  It'll be used whenever
>>possible.  If it is not possible (for instance, if a given processor is
>>not available) it is mostly no good idea to completely fail the
> 
> 
> Agreed. That is why prefered and bind are different policies
> and you can switch between them in libnuma. 

That is inadequate.  Any process/thread state like this increases the
program cost since it means that the program at all times must remember
the current state and switch if necessary.  Combine this with 3rd party
libraries using the functionality as well and you'll get explicit
switching before *every* memory allocation because one cannot assume
anything about the state.  Even if the NUMA library keeps track of the
state internally, there is always the possibility that more than one
instance of the library is used at any one time (e.g., statically linked
into a DSO).

I repeast myself: global or thread-local states are bad.  Always have
been, always will be.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
  2004-04-30  8:30 ` William Lee Irwin III
@ 2004-05-03 18:37   ` Ulrich Drepper
  2004-05-04 10:01     ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Ulrich Drepper @ 2004-05-03 18:37 UTC (permalink / raw
  To: William Lee Irwin III; +Cc: Linux Kernel

William Lee Irwin III wrote:
> I very
> very strongly suggest that you take up each of these issues with him

And what exactly do you think this is about?

> so
> that they can be addressed as individual incremental improvements

That's not a possibility.  The interface is simply inadequate.

I do not claim to be the expert when it comes to all the fancy NUMA
functionality.  But I surely can recognize a broken library interface.
*That's* my concern.  I do not yet care too much about the kernel interface.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA API
  2004-05-03 18:37   ` Ulrich Drepper
@ 2004-05-04 10:01     ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2004-05-04 10:01 UTC (permalink / raw
  To: Ulrich Drepper; +Cc: William Lee Irwin III, Linux Kernel

On Mon, May 03, 2004 at 11:37:41AM -0700, Ulrich Drepper wrote:
> I do not claim to be the expert when it comes to all the fancy NUMA
> functionality.  But I surely can recognize a broken library interface.
> *That's* my concern.  I do not yet care too much about the kernel interface.

Then it's rather offtopic for this list.  If you need additions and/or changes
to the kernel interface to your library we'd love to hear about that as early
as possible, though.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-05-04 10:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-30  7:35 NUMA API Ulrich Drepper
2004-04-30  8:30 ` William Lee Irwin III
2004-05-03 18:37   ` Ulrich Drepper
2004-05-04 10:01     ` Christoph Hellwig
2004-04-30  8:49 ` Paul Jackson
2004-04-30  9:50   ` William Lee Irwin III
2004-05-03 12:48 ` NUMA API - wish list Zoltan Menyhart
2004-05-03 17:57   ` Paul Jackson
     [not found] <1QAMU-4gf-15@gated-at.bofh.it>
2004-04-30 20:01 ` NUMA API Andi Kleen
2004-05-01  5:15   ` Martin J. Bligh
2004-05-03 18:34   ` Ulrich Drepper
2004-04-30 20:39 ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.