RFC: design for new VM

From: r...@conectiva.com.br (Rik van Riel)
Subject: RFC: design for new VM
Date: 2000/08/02
Message-ID: <Pine.LNX.4.21.0008021212030.16377-100000@duckman.distro.conectiva>
X-Deja-AN: 653785019
Sender: owner-linux-ker...@vger.rutgers.edu
X-Sender: r...@duckman.distro.conectiva
X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

[Linus: I'd really like to hear some comments from you on this idea]

Hi,

here is a (rough) draft of the design for the new VM, as
discussed at UKUUG and OLS. The design is heavily based
on the FreeBSD VM subsystem - a proven design - with some
tweaks where we think things can be improved. Some of the
ideas in this design are not fully developed, but none of
those "new" ideas are essential to the basic design.

The design is based around the following ideas:
- center-balanced page aging, using
    - multiple lists to balance the aging
    - a dynamic inactive target to adjust
      the balance to memory pressure
- physical page based aging, to avoid the "artifacts"
  of virtual page scanning
- separated page aging and dirty page flushing
    - kupdate flushing "old" data
    - kflushd syncing out dirty inactive pages
    - as long as there are enough (dirty) inactive pages,
      never mess up aging by searching for clean active
      pages ... even if we have to wait for disk IO to
      finish
- very light background aging under all circumstances, to
  avoid half-hour old referenced bits hanging around

		Center-balanced page aging:

- goals
    - always know which pages to replace next
    - don't spend too much overhead aging pages
    - do the right thing when the working set is
      big but swapping is very very light (or none)
    - always keep the working set in memory in
      favour of use-once cache

- page aging almost like in 2.0, only on a physical page basis
    - page->age starts at PAGE_AGE_START for new pages
    - if (referenced(page)) page->age += PAGE_AGE_ADV;
    - else page->age is made smaller (linear or exponential?)
    - if page->age == 0, move the page to the inactive list
    - NEW IDEA: age pages with a lower page age

- data structures (page lists)
    - active list
        - per node/pgdat
        - contains pages with page->age > 0
        - pages may be mapped into processes
        - scanned and aged whenever we are short
          on free + inactive pages
        - maybe multiple lists for different ages,
          to be better resistant against streaming IO
          (and for lower overhead)
    - inactive_dirty list
        - per zone
        - contains dirty, old pages (page->age == 0)
        - pages are not mapped in any process
    - inactive_clean list
        - per zone
        - contains clean, old pages
        - can be reused by __alloc_pages, like free pages
        - pages are not mapped in any process
    - free list
        - per zone
        - contains pages with no useful data
        - we want to keep a few (dozen) of these around for
          recursive allocations

- other data structures
    - int memory_pressure
        - on page allocation or reclaim, memory_pressure++
        - on page freeing, memory_pressure--  (keep it >= 0, though)
        - decayed on a regular basis (eg. every second x -= x>>6)
        - used to determine inactive_target
    - inactive_target == one (two?) second(s) worth of memory_pressure,
      which is the amount of page reclaims we'll do in one second
        - free + inactive_clean >= zone->pages_high
        - free + inactive_clean + inactive_dirty >= zone->pages_high \
                + one_second_of_memory_pressure * (zone_size / memory_size)
    - inactive_target will be limited to some sane maximum
      (like, num_physpages / 4)

The idea is that when we have enough old (inactive + free)
pages, we will NEVER move pages from the active list to the
inactive lists. We do that because we'd rather wait for some
IO completion than evict the wrong page.

Kflushd / bdflush will have the honourable task of syncing
the pages in the inactive_dirty list to disk before they
become an issue. We'll run balance_dirty over the set of
free + inactive_clean + inactive_dirty AND we'll try to
keep free+inactive_clean > pages_high .. failing either of
these conditions will cause bdflush to kick into action and
sync some pages to disk.

If memory_pressure is high and we're doing a lot of dirty
disk writes, the bdflush percentage will kick in and we'll
be doing extra-agressive cleaning. In that case bdflush
will automatically become more agressive the more page
replacement is going on, which is a good thing.

		Physical page based page aging

In the new VM we'll need to do physical page based page aging
for a number of reasons. Ben LaHaise said he already has code
to do this and it's "dead easy", so I take it this part of the
code won't be much of a problem.

The reasons we need to do aging on a physical page are:
    - avoid the virtual address based aging "artifacts"
    - more efficient, since we'll only scan what we need
      to scan  (especially when we'll test the idea of
      aging pages with a low age more often than pages
      we know to be in the working set)
    - more direct feedback loop, so less chance of
      screwing up the page aging balance

		IO clustering

IO clustering is not done by the VM code, but nicely abstracted
away into a page->mapping->flush(page) callback. This means that:
- each filesystem (and swap) can implement their own, isolated
  IO clustering scheme
- (in 2.5) we'll no longer have the buffer head list, but a list
  of pages to be written back to disk, this means doing stuff like
  delayed allocation (allocate on flush) or kiobuf based extents
  is fairly trivial to do

		Misc

Page aging and flushing are completely separated in this
scheme. We'll never end up aging and freeing a "wrong" clean
page because we're waiting for IO completion of old and
to-be-freed pages.

Write throttling comes quite naturally in this scheme. If we
have too many dirty inactive pages we'll write throttle. We
don't have to take dirty active pages into account since those
are no candidate for freeing anyway. Under light write loads
we will never write throttle (good) and under heavy write
loads the inactive_target will be bigger and write throttling
is more likely to kick in.

Some background page aging will always be done by the system.
We need to do this to clear away referenced bits every once in
a while. If we don't do this we can end up in the situation where,
once memory pressure kicks in, pages which haven't been referenced
in half an hour still have their referenced bit set and we have no
way of distinguishing between newly referenced pages and ancient
pages we really want to free.   (I believe this is one of the causes
of the "freeze" we can sometimes see in current kernels)

Over the next weeks (months?) I'll be working on implementing the
new VM subsystem for Linux, together with various other people
(Andrea Arcangeli??, Ben LaHaise, Juan Quintela, Stephen Tweedie).
I hope to have it ready in time for 2.5.0, but if the code turns
out to be significantly more stable under load than the current
2.4 code I won't hesitate to submit it for 2.4.bignum...

regards,

Rik
--
"What you're running that piece of s*** Gnome?!?!"
         -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: torva...@transmeta.com (Linus Torvalds)
Subject: Re: RFC: design for new VM
Date: 2000/08/03
Message-ID: <Pine.LNX.4.10.10008031020440.6384-100000@penguin.transmeta.com>
X-Deja-AN: 654115657
Sender: owner-linux-ker...@vger.rutgers.edu
References: <Pine.LNX.4.21.0008021212030.16377-100000@duckman.distro.conectiva>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 2 Aug 2000, Rik van Riel wrote:
>
> [Linus: I'd really like to hear some comments from you on this idea]

I am completely and utterly baffled on why you think that the multi-list
approach would help balancing.

Every single indication we have ever had is that balancing gets _harder_
when you have multiple sources of pages, not easier.

As far as I can tell, the only advantage of multiple lists compared to the
current one is to avoid overhead in walking extra pages, no?

And yet you claim that you see no way to fix the current VM behaviour.

This is illogical, and sounds like complete crap to me.

Why don't you just do it with the current scheme (the only thing needed to
be added to the current scheme being the aging, which we've had before),
and prove that the _balancing_ works. If you can prove that the balancing
works but that we spend unnecessary time in scanning the pages, then
you've proven that the basic VM stuff is right, and then the multiple
queues becomes a performance optimization.

Yet you seem to sell the "multiple queues" idea as some fundamental
change. I don't see that. Please explain what makes your ideas so
radically different?

> The design is based around the following ideas:
> - center-balanced page aging, using
>     - multiple lists to balance the aging
>     - a dynamic inactive target to adjust
>       the balance to memory pressure
> - physical page based aging, to avoid the "artifacts"
>   of virtual page scanning
> - separated page aging and dirty page flushing
>     - kupdate flushing "old" data
>     - kflushd syncing out dirty inactive pages
>     - as long as there are enough (dirty) inactive pages,
>       never mess up aging by searching for clean active
>       pages ... even if we have to wait for disk IO to
>       finish
> - very light background aging under all circumstances, to
>   avoid half-hour old referenced bits hanging around

As far as I can tell, the above is _exactly_ equivalent to having one
single list, and multiple "scan-points" on that list. 

A "scan-point" is actually very easy to implement: anybody at all who
needs to scan the list can just include his own "anchor-page": a "struct
page_struct" that is purely local to that particular scanner, and that
nobody else will touch because it has an artificially elevated usage count
(and because there is actually no real page associated with that virtual
"struct page" the page count will obviosly never decrease ;).

Then, each scanner just advances its own anchor-page around the list, and
does whatever it is that the scanner is designed to do on the page it
advances over. So "bdflush" would do

	..
	lock_list();
	struct page *page = advance(&bdflush_entry);
	if (page->buffer) {
		get_page(page);
		unlock_list();
		flush_page(page);
		continue;
	}
	unlock_list();
	..

while the page ager would do

	lock_list();
	struct page *page = advance(&bdflush_entry);
	page->age = page->age >> 1;
	if (PageReferenced(page))
		page->age += PAGE_AGE_REF;
	unlock_list();

etc.. Basically, you can have any number of virtual "clocks" on a single
list.

No radical changes necessary. This is something we can easily add to
2.4.x.

The reason I'm unconvinced about multiple lists is basically:

 - they are inflexible. Each list has a meaning, and a page cannot easily
   be on more than one list. It's really hard to implement overlapping
   meanings: you get exponential expanision of combinations, and everybody
   has to be aware of them.

   For example, imagine that the definition of "dirty" might be different
   for different filesystems.  Imagine that you have a filesystem with its
   own specific "walk the pages to flush out stuff", with special logic
   that is unique to that filesystem ("you cannot write out this page
   until you've done 'Y' or whatever). This is hard to do with your
   approach. It is trivial to do with the single-list approach above.

   More realistic (?) example: starting write-back of pages is very
   different from waiting on locked pages. We may want to have a "dirty
   but not yet started" list, and a "write-out started but not completed"
   locked list. Right now we use the same "clock" for them (the head of
   the LRU queue with some ugly heuristic to decide whether we want to
   wait on anything).

   But we potentially really want to have separate logic for this: we want
   to have a background "start writeout" that goes on all the time, and
   then we want to have a separate "start waiting" clock that uses
   different principles on which point in the list to _wait_ on stuff.

   This is what we used to have in the old buffer.c code (the 2.0 code
   that Alan likes). And it was _horrible_ to have separate lists, because
   in fact pages can be both dirty and locked and they really should have
   been on both lists etc..

 - in contrast, scan-points (withour LRU, but instead working on the basis
   of the age of the page - which is logically equivalent) offer the
   potential for specialized scanners. You could have "statistics
   gathering robots" that you add dynamically. Or you could have
   per-device flush deamons.

   For example, imagine a common problem with floppies: we have a timeout
   for the floppy motor because it's costly to start them up again. And
   they are removable. A perfect floppy driver would notice when it is
   idle, and instead of turning off the motor it might decide to scan for
   dirty pages for the floppy on the (correct) assumption that it would be
   nice to have them all written back instead of turning off the motor and
   making the floppy look idle.

   With a per-device "dirty list" (which you can test out with a page
   scanner implementation to see if it ends up reall yimproving floppy
   behaviour) you could essentially have a guarantee: whenever the floppy
   motor is turned off, the filesystem on that floppy is synced.
   Test implementation: floppy deamon that walks the list and turns off
   the engine only after having walked it without having seen any dirty
   blocks.

   In the end, maybe you realize that you _really_ don't want a dirty list
   at all. You want _multiple_ dirty lists, one per device.

   And that's really my point. I think you're too eager to rewrite things,
   and not interested enough in verifying that it's the right thing. Which
   I think you can do with the current one-list thing easily enough.

 - In the end, even if you don't need the extra flexibility of multiple
   clocks, splitting them up into separate lists doesn't change behaviour,
   it's "only" a CPU time optimization.

   Which may well be worth it, don't get me wrong. But I don't see why you
   tout this as being something radically needed in order to get better VM
   behaviour. Sure, multiple lists avoids the unnecessary walking over
   pages that we don't care about for some particular clock. And they may
   well end up being worth it for that reason. But it's not a very good
   way of doing prototyping of the actual _behaviour_ of the lists.

To make a long story short, I'd rather see a proof-of-concept thing. And I
distrust your notion that "we can't do it with the current setup, we'll
have to implement something radically different". 

Bascially, IF you think that your newly designed VM should work, then you
should be able to prototype and prove it easily enough with the current
one. 

I'm personally of the opinion that people see that page aging etc is hard,
so they try to explain the current failures by claiming that it needs a
completely different approach. And in the end, I don't see what's so
radically different about it - it's just a re-organization. And as far as
I can see it is pretty much logically equivalent to just minor tweaks of
the current one.

(The _big_ change is actually the addition of a proper "age" field. THAT
is conceptually a very different approach to the matter. I agree 100% with
that, and the reason I don't get all that excited about it is just that we
_have_ done page aging before, and we dropped it for probably bad reasons,
and adding it back should not be that big of a deal. Probabl yless than 50
lines of diff).

Read Dilbert about the effectiveness of (and reasons for)  re-
organizations.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Wedgwood <c...@f00f.org>
Subject: Re: RFC: design for new VM
Date: 2000/08/03
Message-ID: <linux.kernel.20000803191906.B562@metastasis.f00f.org>#1/1
X-Deja-AN: 653924407
Approved: n...@nntp-server.caltech.edu
X-To: Rik van Riel <r...@conectiva.com.br>
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0
X-Cc: linux...@kvack.org, linux-ker...@vger.rutgers.edu, Linus Torvalds <torva...@transmeta.com>
Newsgroups: mlist.linux.kernel

On Wed, Aug 02, 2000 at 07:08:52PM -0300, Rik van Riel wrote:

    here is a (rough) draft of the design for the new VM, as
    discussed at UKUUG and OLS. The design is heavily based
    on the FreeBSD VM subsystem - a proven design - with some
    tweaks where we think things can be improved. 

Can the differences between your system and what FreeBSD has be
isolated or contained -- I ask this because the FreeBSD VM works
_very_ well compared to recent linux kernels; if/when the new system
is implement it would nice to know if performance differences are
tuning related or because of 'tweaks'.



  --cw

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Theodore Y. Ts'o" <ty...@MIT.EDU>
Subject: Re: RFC: design for new VM
Date: 2000/08/05
Message-ID: <linux.kernel.200008052248.SAA00643@tsx-prime.MIT.EDU>#1/1
X-Deja-AN: 654919808
Approved: n...@nntp-server.caltech.edu
X-To: Rik van Riel <r...@conectiva.com.br>
X-CC: Chris Wedgwood <c...@f00f.org>, linux...@kvack.org, 
linux-ker...@vger.rutgers.edu, Matthew Dillon <dil...@apollo.backplane.com>
Newsgroups: mlist.linux.kernel

   Date:   Thu, 3 Aug 2000 13:01:56 -0300 (BRST)
   From: Rik van Riel <r...@conectiva.com.br>

   You're right, the differences between FreeBSD VM and the new
   Linux VM should be clearly indicated.

   > I ask this because the FreeBSD VM works _very_ well compared to
   > recent linux kernels; if/when the new system is implement it
   > would nice to know if performance differences are tuning related
   > or because of 'tweaks'.

   Indeed. The amount of documentation (books? nah..) on VM
   is so sparse that it would be good to have both systems
   properly documented. That would fill a void in CS theory
   and documentation that was painfully there while I was
   trying to find useful information to help with the design
   of the new Linux VM...

... and you know, once written, it would make a *wonderful* paper to
present at Freenix or for ALS.... (speaking as someone who has been on
program committees for both conferences :-)

						- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: dil...@apollo.backplane.com (Matthew Dillon)
Subject: Re: RFC: design for new VM
Date: 2000/08/04
Message-ID: <200008041541.IAA88364@apollo.backplane.com>#1/1
X-Deja-AN: 654461672
Sender: owner-linux-ker...@vger.rutgers.edu
References: <Pine.LNX.4.21.0008031243070.24022-100000@duckman.distro.conectiva>
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

:>     here is a (rough) draft of the design for the new VM, as
:>     discussed at UKUUG and OLS. The design is heavily based
:>     on the FreeBSD VM subsystem - a proven design - with some
:>     tweaks where we think things can be improved. 
:> 
:> Can the differences between your system and what FreeBSD has be
:> isolated or contained
:
:You're right, the differences between FreeBSD VM and the new
:Linux VM should be clearly indicated.
:
:> I ask this because the FreeBSD VM works _very_ well compared to
:> recent linux kernels; if/when the new system is implement it
:> would nice to know if performance differences are tuning related
:> or because of 'tweaks'.
:
:Indeed. The amount of documentation (books? nah..) on VM
:is so sparse that it would be good to have both systems
:properly documented. That would fill a void in CS theory
:and documentation that was painfully there while I was
:trying to find useful information to help with the design
:of the new Linux VM...
:
:regards,
:
:Rik

    Three or four times in the last year I've gotten emails from 
    people looking for 'VM documentation' or 'books they could read'.
    I couldn't find a blessed thing!  Oh, sure, there are papers strewn
    about, but most are very focused on single aspects of a VM design.
    I have yet to find anything that covers the whole thing.  I've written
    up an occassional 'summary piece' for FreeBSD, e.g. the Jan 2000 Daemon
    News article, but that really isn't adequate.

    The new Linux VM design looks exciting!  I will be paying close 
    attention to your progress with an eye towards reworking some of
    FreeBSD's code.  Except for one or two eyesores (1) the FreeBSD code is
    algorithmically sound, but pieces of the implementation are rather
    messy from years of patching.  When I first started working on it
    the existing crew had a big bent towards patching rather then
    rewriting and I had to really push to get some of my rewrites
    through.  The patching had reached the limits of the original 
    code-base's flexibility.

    note(1) - the one that came up just last week was the O(N) nature
    of the FreeBSD VM maps (linux uses an AVL tree here).  These work
    fine for 95% of the apps out there but turn into a sludgepile for
    things like malloc debuggers and distributed shared memory systems
    which want to mprotect() on a page-by-page basis.   The second eyesore
    is the lack of physically shared page table segments for 'standard'
    processes.  At the moment, it's an all (rfork/RFMEM/clone) or nothing
    (fork) deal.  Physical segment sharing outside of clone is something
    Linux could use to, I don't think it does it either.  It's not easy to
    do right.

					-Matt
					Matthew Dillon 
					<dil...@backplane.com>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: torva...@transmeta.com (Linus Torvalds)
Subject: Re: RFC: design for new VM
Date: 2000/08/04
Message-ID: <Pine.LNX.4.10.10008041033230.813-100000@penguin.transmeta.com>#1/1
X-Deja-AN: 654504126
Sender: owner-linux-ker...@vger.rutgers.edu
References: <200008041541.IAA88364@apollo.backplane.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 4 Aug 2000, Matthew Dillon wrote:
> 
>   							  The second eyesore
>     is the lack of physically shared page table segments for 'standard'
>     processes.  At the moment, it's an all (rfork/RFMEM/clone) or nothing
>     (fork) deal.  Physical segment sharing outside of clone is something
>     Linux could use to, I don't think it does it either.  It's not easy to
>     do right.

It's probably impossible to do right. Basically, if you do it, you do it
wrong.

As far as I can tell, you basically screw yourself on the TLB and locking
if you ever try to implement this. And frankly I don't see how you could
avoid getting screwed.

There are architecture-specific special cases, of course. On ia64, the
page table is not really one page table, it's a number of pretty much
independent page tables, and it would be possible to extend the notion of
fork vs clone to be a per-page-table thing (ie the single-bit thing would
become a multi-bit thing, and the single "struct mm_struct" would become
an array of independent mm's).

You could do similar tricks on x86 by virtually splitting up the page
directory into independent (fixed-size) pieces - this is similar to what
the PAE stuff does in hardware, after all. So you could have (for example)
each process be quartered up into four address spaces with the top two
address bits being the address space sub-ID.

Quite frankly, it tends to be a nightmare to do that. It's also
unportable: it works on architectures that either support it natively
(like the ia64 that has the split page tables because of how it covers
large VM areas) or by "faking" the split on regular page tables. But it
does _not_ work very well at all on CPU's where the native page table is
actually a hash (old sparc, ppc, and the "other mode" in IA64). Unless the
hash happens to have some of the high bits map into a VM ID (which is
common, but not really something you can depend on).

And even when it "works" by emulation, you can't share the TLB contents
anyway. Again, it can be possible on a per-architecture basis (if the
different regions can have different ASI's - ia64 again does this, and I
think it originally comes from the 64-bit PA-RISC VM stuff). But it's one
of those bad ideas that if people start depending on it, it simply won't
work that well on some architectures. And one of the beauties of UNIX is
that it truly is fairly architecture-neutral.

And that's just the page table handling. The SMP locking for all this
looks even worse - you can't share a per-mm lock like with the clone()
thing, so you have to create some other locking mechanism. 

I'd be interested to hear if you have some great idea (ie "oh, if you look
at it _this_ way all your concerns go away"), but I suspect you have only
looked at it from 10,000 feet and thought "that would be a cool thing".
And I suspect it ends up being anything _but_ cool once actually
implemented.

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: dil...@apollo.backplane.com (Matthew Dillon)
Subject: Re: RFC: design for new VM
Date: 2000/08/04
Message-ID: <200008042351.QAA89101@apollo.backplane.com>#1/1
X-Deja-AN: 654613804
Sender: owner-linux-ker...@vger.rutgers.edu
References: <Pine.LNX.4.10.10008041033230.813-100000@penguin.transmeta.com>
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu


:>     (fork) deal.  Physical segment sharing outside of clone is something
:>     Linux could use to, I don't think it does it either.  It's not easy to
:>     do right.
:
:It's probably impossible to do right. Basically, if you do it, you do it
:wrong.
:
:As far as I can tell, you basically screw yourself on the TLB and locking
:if you ever try to implement this. And frankly I don't see how you could
:avoid getting screwed.
:
:There are architecture-specific special cases, of course. On ia64, the
:..

    I spent a weekend a few months ago trying to implement page table 
    sharing in FreeBSD -- and gave up, but it left me with the feeling
    that it should be possible to do without polluting the general VM
    architecture.

    For IA32, what it comes down to is that the page table generated by
    any segment-aligned mmap() (segment == 4MB) made by two processes 
    should be shareable, simply be sharing the page directory entry (and thus
    the physical page representing 4MB worth of mappings).  This would be
    restricted to MAP_SHARED mappings with the same protections, but the two
    processes would not have to map the segments at the same VM address, they
    need only be segment-aligned.

    This would be a transparent optimization wholely invisible to the process,
    something that would be optionally implemented in the machine-dependant
    part of the VM code (with general support in the machine-independant
    part for the concept).  If the process did anything to create a mapping
    mismatch, such as call mprotect(), the shared page table would be split.

    The problem being solved for FreeBSD is actually quite serious -- due to
    FreeBSD's tracking of individual page table entries, being able to share
    a page table would radically reduce the amount of tracking information
    required for any large shared areas (shared libraries, large shared file
    mappings, large sysv shared memory mappings).  For linux the problem is
    relatively minor - linux would save considerable page table memory.
    Linux is still reasonably scaleable without the optimization while 
    FreeBSD currently falls on its face for truely huge shared mappings
    (e.g. 300 processes all mapping a shared 1GB memory area, aka Oracle 8i).
    (Linux falls on its face for other reasons, mainly the fact that it
    maps all of physical memory into KVM in order to manage it).

    I think the loss of MP locking for this situation is outweighed by the
    benefit of a huge reduction in page faults -- rather then see 300 
    processes each take a page fault on the same page, only the first process
    would and the pte would already be in place when the others got to it.
    When it comes right down to it, page faults on shared data sets are not
    really an issue for MP scaleability.

    In anycase, this is a 'dream' for me for FreeBSD right now.  It's a very 
    difficult problem to solve.

						-Matt



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: torva...@transmeta.com (Linus Torvalds)
Subject: Re: RFC: design for new VM
Date: 2000/08/05
Message-ID: <Pine.LNX.4.10.10008041655420.11340-100000@penguin.transmeta.com>#1/1
X-Deja-AN: 654617173
Sender: owner-linux-ker...@vger.rutgers.edu
References: <200008042351.QAA89101@apollo.backplane.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 4 Aug 2000, Matthew Dillon wrote:
> :
> :There are architecture-specific special cases, of course. On ia64, the
> :..
> 
>     I spent a weekend a few months ago trying to implement page table 
>     sharing in FreeBSD -- and gave up, but it left me with the feeling
>     that it should be possible to do without polluting the general VM
>     architecture.
> 
>     For IA32, what it comes down to is that the page table generated by
>     any segment-aligned mmap() (segment == 4MB) made by two processes 
>     should be shareable, simply be sharing the page directory entry (and thus
>     the physical page representing 4MB worth of mappings).  This would be
>     restricted to MAP_SHARED mappings with the same protections, but the two
>     processes would not have to map the segments at the same VM address, they
>     need only be segment-aligned.

I agree that from a page table standpoint you should be correct. 

I don't think that the other issues are as easily resolved, though.
Especially with address space ID's on other architectures it can get
_really_ interesting to do TLB invalidates correctly to other CPU's etc
(you need to keep track of who shares parts of your page tables etc).

>     This would be a transparent optimization wholely invisible to the process,
>     something that would be optionally implemented in the machine-dependant
>     part of the VM code (with general support in the machine-independant
>     part for the concept).  If the process did anything to create a mapping
>     mismatch, such as call mprotect(), the shared page table would be split.

Right. But what about the TLB?

It's not a problem on the x86, because the x86 doesn't have ASN's anyway.
But fo rit to be a valid notion, I feel that it should be able to be
portable too.

You have to have some page table locking mechanism for SMP eventually: I
think you miss some of the problems because the current FreeBSD SMP stuff
is mostly still "big kernel lock" (outdated info?), and you'll end up
kicking yourself in a big way when you have the 300 processes sharing the
same lock for that region..

(Not that I think you'd necessarily have much contention on the lock - the
problem tends to be more in the logistics of keeping track of the locks of
partial VM regions etc).

>     (Linux falls on its face for other reasons, mainly the fact that it
>     maps all of physical memory into KVM in order to manage it).

Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;)

>     I think the loss of MP locking for this situation is outweighed by the
>     benefit of a huge reduction in page faults -- rather then see 300 
>     processes each take a page fault on the same page, only the first process
>     would and the pte would already be in place when the others got to it.
>     When it comes right down to it, page faults on shared data sets are not
>     really an issue for MP scaleability.

I think you'll find that there are all these small details that just
cannot be solved cleanly. Do you want to be stuck with a x86-only
solution?

That said, I cannot honestly say that I have tried very hard to come up
with solutions. I just have this feeling that it's a dark ugly hole that I
wouldn't want to go down..

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: dil...@apollo.backplane.com (Matthew Dillon)
Subject: Re: RFC: design for new VM
Date: 2000/08/05
Message-ID: <200008050152.SAA89298@apollo.backplane.com>#1/1
X-Deja-AN: 654639522
Sender: owner-linux-ker...@vger.rutgers.edu
References: <Pine.LNX.4.10.10008041655420.11340-100000@penguin.transmeta.com>
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

:I agree that from a page table standpoint you should be correct. 
:
:I don't think that the other issues are as easily resolved, though.
:Especially with address space ID's on other architectures it can get
:_really_ interesting to do TLB invalidates correctly to other CPU's etc
:(you need to keep track of who shares parts of your page tables etc).
:
:...
:>     mismatch, such as call mprotect(), the shared page table would be split.
:
:Right. But what about the TLB?

    I'm not advocating trying to share TLB entries, that would be 
    a disaster.  I'm contemplating just the physical page table structure.
    e.g. if you mmap() a 1GB file shared (or private read-only) into 300
    independant processes, it should be possible to share all the meta-data
    required to support that mapping except for the TLB entries themselves.
    ASNs shouldn't make a difference... presumably the tags on the TLB
    entries are added on after the metadata lookup.  I'm also not advocating
    attempting to share intermediate 'partial' in-memory TLB caches (hash
    tables or other structures).  Those are typically fixed in size,
    per-cpu, and would not be impacted by scale.

:You have to have some page table locking mechanism for SMP eventually: I
:think you miss some of the problems because the current FreeBSD SMP stuff
:is mostly still "big kernel lock" (outdated info?), and you'll end up
:kicking yourself in a big way when you have the 300 processes sharing the
:same lock for that region..

    If it were a long-held lock I'd worry, but if it's a lock on a pte
    I don't think it can hurt.  After all, even with separate page tables
    if 300 processes fault on the same backing file offset you are going
    to hit a bottleneck with MP locking anyway, just at a deeper level
    (the filesystem rather then the VM system).  The BSDI folks did a lot
    of testing with their fine-grained MP implementation and found that
    putting a global lock around the entire VM system had absolutely no 
    impact on MP performance.

:>     (Linux falls on its face for other reasons, mainly the fact that it
:>     maps all of physical memory into KVM in order to manage it).
:
:Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;)

    Oh, that's cool!  I don't think anyone in FreeBSDland has bothered with
    large-memory (> 4GB) memory configurations, there doesn't seem to be 
    much demand for such a thing on IA32.

:>     I think the loss of MP locking for this situation is outweighed by the
:>     benefit of a huge reduction in page faults -- rather then see 300 
:>     processes each take a page fault on the same page, only the first process
:>     would and the pte would already be in place when the others got to it.
:>     When it comes right down to it, page faults on shared data sets are not
:>     really an issue for MP scaleability.
:
:I think you'll find that there are all these small details that just
:cannot be solved cleanly. Do you want to be stuck with a x86-only
:solution?
:
:That said, I cannot honestly say that I have tried very hard to come up
:with solutions. I just have this feeling that it's a dark ugly hole that I
:wouldn't want to go down..
:
:			Linus

    Well, I don't think this is x86-specific.  Or, that is, I don't think it
    would pollute the machine-independant code.  FreeBSD has virtually no
    notion of 'page tables' outside the i386-specific VM files... it doesn't
    use page tables (or two-level page-like tables... is Linux still using
    those?) to store meta information at all in the higher levels of the
    kernel.  It uses architecture-independant VM objects and vm_map_entry
    structures for that.  Physical page tables on FreeBSD are 
    throw-away-at-any-time entities.  The actual implementation of the
    'page table' in the IA32 sense occurs entirely in the machine-dependant
    subdirectory for IA32.  

    A page-table sharing mechanism would have to implement the knowledge --
    the 'potential' for sharing at a higher level (the vm_map_entry 
    structure), but it would be up to the machine-dependant VM code to
    implement any actual sharing given that knowledge.  So while the specific
    implementation for IA32 is definitely machine-specific, it would have
    no effect on other OS ports (of course, we have only one other
    working port at the moment, to the alpha, but you get the idea).

					-Matt
					Matthew Dillon 
					<dil...@backplane.com>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: torva...@transmeta.com (Linus Torvalds)
Subject: Re: RFC: design for new VM
Date: 2000/08/05
Message-ID: <Pine.LNX.4.10.10008041854240.1727-100000@penguin.transmeta.com>#1/1
X-Deja-AN: 654642040
Sender: owner-linux-ker...@vger.rutgers.edu
References: <200008050152.SAA89298@apollo.backplane.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu



On Fri, 4 Aug 2000, Matthew Dillon wrote:
> :
> :Right. But what about the TLB?
> 
>     I'm not advocating trying to share TLB entries, that would be 
>     a disaster.

You migth have to, if the machine has a virtually mapped cache.. 

Ugh. That gets too ugly to even contemplate, actually. Just forget the
idea.

>     If it were a long-held lock I'd worry, but if it's a lock on a pte
>     I don't think it can hurt.  After all, even with separate page tables
>     if 300 processes fault on the same backing file offset you are going
>     to hit a bottleneck with MP locking anyway, just at a deeper level
>     (the filesystem rather then the VM system).  The BSDI folks did a lot
>     of testing with their fine-grained MP implementation and found that
>     putting a global lock around the entire VM system had absolutely no 
>     impact on MP performance.

Hmm.. That may be load-dependent, but I know it wasn't true for Linux. The
kernel lock for things like brk() were some of the worst offenders, and
people worked hard on making mmap() and friends not need the BKL exactly
because it showed up very clearly in the lock profiles.

> :>     (Linux falls on its face for other reasons, mainly the fact that it
> :>     maps all of physical memory into KVM in order to manage it).
> :
> :Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;)
> 
>     Oh, that's cool!  I don't think anyone in FreeBSDland has bothered with
>     large-memory (> 4GB) memory configurations, there doesn't seem to be 
>     much demand for such a thing on IA32.

Not normally no. Linux didn't start seeing the requirement until last year
or so, when running big databases and big benchmarks just required it
because the working set was so big. "dbench" with a lot of clients etc.

Now, whether such a working set is realistic or not is another issue, of
course. 64GB isn't as much memory as it used to be, though, and we
couldn't have beated the mindcraft NT numbers without large memory
support.

>     Well, I don't think this is x86-specific.  Or, that is, I don't think it
>     would pollute the machine-independant code.  FreeBSD has virtually no
>     notion of 'page tables' outside the i386-specific VM files... it doesn't
>     use page tables (or two-level page-like tables... is Linux still using
>     those?) to store meta information at all in the higher levels of the
>     kernel.  It uses architecture-independant VM objects and vm_map_entry
>     structures for that.  Physical page tables on FreeBSD are 
>     throw-away-at-any-time entities.  The actual implementation of the
>     'page table' in the IA32 sense occurs entirely in the machine-dependant
>     subdirectory for IA32.  

It's not the page tables themselves I worry about, but all the meta-data
synchronization requirements. But hey. Go wild, prove me wrong.

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: l...@veszprog.hu (Gabor Lenart)
Subject: Re: RFC: design for new VM
Date: 2000/08/07
Message-ID: <20000807121145.D2872@veszprog.hu>#1/1
X-Deja-AN: 655365329
X-Operating-System: galaxy Linux 2.2.16 i686
Content-Transfer-Encoding: QUOTED-PRINTABLE
Sender: owner-linux-ker...@vger.rutgers.edu
References: <Pine.LNX.4.21.0008021212030.16377-100000@duckman.distro.conectiva> 
<20000803191906.B562@metastasis.f00f.org>
Content-Type: text/plain; charset=iso-8859-2
MIME-Version: 1.0
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, Aug 03, 2000 at 07:19:06PM +1200, Chris Wedgwood wrote:
> On Wed, Aug 02, 2000 at 07:08:52PM -0300, Rik van Riel wrote:
>=20
>     here is a (rough) draft of the design for the new VM, as
>     discussed at UKUUG and OLS. The design is heavily based
>     on the FreeBSD VM subsystem - a proven design - with some
>     tweaks where we think things can be improved.=20
>=20
> Can the differences between your system and what FreeBSD has be
> isolated or contained -- I ask this because the FreeBSD VM works
> _very_ well compared to recent linux kernels; if/when the new system
> is implement it would nice to know if performance differences are
> tuning related or because of 'tweaks'.

A little question. AFAIK Linux needs less memory than FreeBSD. The new
=46reeBSD like VM will casue Linux won't work on little machines which
uses Linux at our Univ because they're too powerless for running FreeBS=
D
(the pervious sysadm ran FreeBSD everywhere and only that machines coul=
dn't
run FreeBSD as fast as Linux).

--=20
 +-[ L=E9n=E1rt G=E1bor ]----[ http://lgb.supervisor.hu/ ]------[+36 30=
 2270823 ]--+
 |--UNIX--OpenSource-->  The future is in our hands.          <--LME--L=
inux--|
 +-----[ Veszprog Kft ]------[ Supervisor BT ]-------[ Expertus Kft ]--=
------+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel"=
 in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: a...@lxorguk.ukuu.org.uk (Alan Cox)
Subject: Re: RFC: design for new VM
Date: 2000/08/07
Message-ID: <E13Lld0-0003XX-00@the-village.bc.nu>#1/1
X-Deja-AN: 655392425
Content-Transfer-Encoding: 7bit
Sender: owner-linux-ker...@vger.rutgers.edu
References: <20000807121145.D2872@veszprog.hu>
Content-Type: text/plain; charset=us-ascii
MIME-Version: 1.0
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

> A little question. AFAIK Linux needs less memory than FreeBSD. The new

It depends what you are doing. Espeically with newer BSD

> FreeBSD like VM will casue Linux won't work on little machines which
> uses Linux at our Univ because they're too powerless for running FreeBSD
> (the pervious sysadm ran FreeBSD everywhere and only that machines couldn't
> run FreeBSD as fast as Linux).

The VM changes will make the small boxes run faster if done right. At least
page aging worked right on 2.0 !


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Gerrit.Huize...@us.ibm.com
Subject: Re: RFC: design for new VM
Date: 2000/08/07
Message-ID: <200008071740.KAA25895@eng2.sequent.com>
X-Deja-AN: 655502734
Sender: owner-linux-ker...@vger.rutgers.edu
References: <8725692F.0079E22B.00@d53mta03h.boulder.ibm.com>
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Reply-To: Gerrit.Huize...@us.ibm.com
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

Hi Rik,

I have a few comments on your RFC for VM.  Some are simply
observational, some are based on our experience locally with the
development, deployment and maintenance of a VM subsystem here at IBM
NUMA-Q (formerly Sequent Computer Systems, Inc.).  As you may remember,
our VM subsystem was initially designed in ~1982-1984 to operate on 30
processor SMP machines, and in roughly 1993-1995 it was updated to
support NUMA systems up to 64 processors.  Our machines started with ~1
GB of physical memory, and today support up to 64 GB of physical memory
on a 32-64 processor machine.  These machines run a single operating
system (DYNIX/ptx) which is derived originally from BSD 4.2, although
the VM subsystem has been completely rewritten over the years.

Along the way, we learned many things about memory latency, large
memory support, SMP & NUMA issues, some of which may be useful to
you in your current design effort.

First, and perhaps foremost, I believe your design deals almost
exclusively with page aging & page replacement algorithms, rather
than being a complete VM redesign, although feel free to correct
me if I have misconstrued that.  For instance, I don't believe you
are planning to redo the 3 or 4 tier page table layering as part
of your effort, nor are you changing memory allocation routines in
any kernel-visible way.  I also don't see any modifications to kernel
pools, general memory management of free pages (e.g. AVL trees vs. 
linked lists), any changes to the PAE mechanism currently in use,
no reference to alternate page sizes (e.g. Intel PSE), buffer/page
cache organization, etc.  I also see nothing in the design which
reduces the needs for global TLB flushes across this system, which
is one area where I believe Linux is starting to suffer as CPU counts
increase.  I believe a full VM redesign would tend to address all of
these issues, even if it did so in a completely modular fashion.

I also note that you intend to draw heavily from the FreeBSD
implementation.  Two areas in which to be very careful here have
already been mentioned, but they are worth restating:  FreeBSD
has little to no SMP experience (e.g. kernel big lock) and little
to no large memory experience.  I believe Linux is actually slighly
more advanced in both of these areas, and a good redesign should
preserve and/or improve on those capabilities.

I believe that your current proposed aging mechanism, while perhaps
a positive refinement of what currently exists, still suffers from
a fundamental problem in that you are globally managing page aging.
In both large memory systems and in SMP systems, scaleability is
greatly enhanced if major capabilities like page aging can in some
way be localized.  One mechanism might be to use something like
per-CPU zones from which private pages are typically allocated from
and freed to.  This, in conjunction with good scheduler affinity,
maximizes the benefits of any CPU L1/L2 cache.  Another mechanism,
and the one that we chose in our operating system, was to use a modified
process resident set sizes as the machanism for page management.  The
basic modifications are to make the RSS tuneable system wide as well
as per process.  The RSS size "flexes" based on available memory and
a processes page fault frequency (PFF).  Frequent page faults force the
RSS to increase, infrequent page faults cause a processes resident size
to shrink.  When memory pressure mounts, the running process manages
itself a little more agressively; processes which have "flexed"
their resident set size beyond their system or per process recommended
maxima are among the first to lose pages.  And, when pressure can not
be addressed to RSS management, swapping starts.

Another fundamental flaw I see with both the current page aging mechanism
and the proposed mechanism is that workloads which exhaust memory pay
no penalty at all until memory is full.  Then there is a sharp spike
in the amount of (slow) IO as pages are flushed, processes are swapped,
etc.  There is no apparent smoothing of spikes, such as increasing the
rate of IO as the rate of memory pressure increases.  With the exception
of laptops, most machines can sustain a small amount of background
asynchronous IO without affecting performance (laptops may want IO
batched to maximize battery life).  I would propose that as memory
pressure increases, paging/swapping IO should increase somewhat
proportionally.  This provides some smoothing for the bursty nature of
most single user or small ISP workloads.  I believe databases style
loads on larger machines would also benefit.

Your current design does not address SMP locking at all.  I would
suggest that a single VM lock would provide reasonable scaleability
up to about 16 processors, depending on page size, memory size, processor
speed, and the ratio of processor speed to memory bandwidth.  One
method for stretching that lock is to use zoned, per-processor (or
per-node) data for local page allocations whenever possible.  Then
local allocations can use minimal locking (need only to protect from
memory allocations in interrupt code).  Further, the layout of memory
in a bitmaped, power of 2 sized "buddy system" can speed allocations,
reducing the amount of time during which a critical lock needs to be
held.  AVL trees will perform similarly well, with the exception that
a resource bitmap tends to be easier on TLB entries and processor
cache.  A bitmaped allocator may also be useful in more efficiently
allocating pages of variable sizes on a CPU which supports variable
sized pages in hardware.

Also, I note that your filesys->flush() mechanism utilizes a call
per page.  This is an interesting capability, although I'd question
the processor efficiency of a page granularity here.  On large memory
systems, with large processes starting (e.g. Netscape, StarOffice, or
possible a database client), it seems like a callback to a filesystem
which said something like flush("I must have at least 10 pages from
you", "and I'd really like 100 pages") might be a better way to
use this advisory capability.  You've already pointed out that you
may request that a specific page might be requested but other pages
may be freed; this may be a more explicit way to code the policy
you really want.

It would also be interesting to review the data structure you intend
to use in terms of cache line layout, as well as look at the algorithms
which use those structures with an eye towards minimizing page & cache
hits for both SMP *and* single processor efficiency.

Hope this is of some help,

Gerrit Huizenga
IBM NUMA-Q (nee' Sequent)
Gerrit.Huize...@us.ibm.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: c...@monkey.org (Chuck Lever)
Subject: Re: RFC: design for new VM
Date: 2000/08/07
Message-ID: <Pine.BSO.4.20.0008071641300.2595-100000@naughty.monkey.org>#1/1
X-Deja-AN: 655572257
Sender: owner-linux-ker...@vger.rutgers.edu
References: <200008071740.KAA25895@eng2.sequent.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Reply-To: chuckle...@bigfoot.com
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

hi gerrit-

good to see you on the list.

On Mon, 7 Aug 2000 Gerrit.Huize...@us.ibm.com wrote:
> Another fundamental flaw I see with both the current page aging mechanism
> and the proposed mechanism is that workloads which exhaust memory pay
> no penalty at all until memory is full.  Then there is a sharp spike
> in the amount of (slow) IO as pages are flushed, processes are swapped,
> etc.  There is no apparent smoothing of spikes, such as increasing the
> rate of IO as the rate of memory pressure increases.  With the exception
> of laptops, most machines can sustain a small amount of background
> asynchronous IO without affecting performance (laptops may want IO
> batched to maximize battery life).  I would propose that as memory
> pressure increases, paging/swapping IO should increase somewhat
> proportionally.  This provides some smoothing for the bursty nature of
> most single user or small ISP workloads.  I believe databases style
> loads on larger machines would also benefit.

2 comments here.

1.  kswapd runs in the background and wakes up every so often to handle
the corner cases that smooth bursty memory request workloads.  it executes
the same code that is invoked from the kernel's memory allocator to
reclaim pages.

2.  i agree with you that when the system exhausts memory, it hits a hard
knee; it would be better to soften this.  however, the VM system is
designed to optimize the case where the system has enough memory.  in
other words, it is designed to avoid unnecessary work when there is no
need to reclaim memory.  this design was optimized for a desktop workload,
like the scheduler or ext2 "async" mode.  if i can paraphrase other
comments i've heard on these lists, it epitomizes a basic design
philosophy: "to optimize the common case gains the most performance
advantage."

can a soft-knee swapping algorithm be demonstrated that doesn't impact the
performance of applications running on a system that hasn't exhausted its
memory?

	- Chuck Lever
--
corporate:	<chu...@netscape.com>
personal:	<chuckle...@bigfoot.com>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@conectiva.com.br>
Subject: Re: RFC: design for new VM 
Date: 2000/08/07
Message-ID: <linux.kernel.Pine.LNX.4.21.0008071844100.25008-100000@duckman.distro.conectiva>#1/1
X-Deja-AN: 655621932
Approved: n...@nntp-server.caltech.edu
X-To: chuckle...@bigfoot.com
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
X-cc: Gerrit.Huize...@us.ibm.com, linux...@kvack.org, 
linux-ker...@vger.rutgers.edu, Linus Torvalds <torva...@transmeta.com>
Newsgroups: mlist.linux.kernel

On Mon, 7 Aug 2000, Chuck Lever wrote:
> On Mon, 7 Aug 2000 Gerrit.Huize...@us.ibm.com wrote:
> > Another fundamental flaw I see with both the current page aging mechanism
> > and the proposed mechanism is that workloads which exhaust memory pay
> > no penalty at all until memory is full.  Then there is a sharp spike
> > in the amount of (slow) IO as pages are flushed, processes are swapped,
> > etc.  There is no apparent smoothing of spikes, such as increasing the
> > rate of IO as the rate of memory pressure increases.  With the exception
> > of laptops, most machines can sustain a small amount of background
> > asynchronous IO without affecting performance (laptops may want IO
> > batched to maximize battery life).  I would propose that as memory
> > pressure increases, paging/swapping IO should increase somewhat
> > proportionally.  This provides some smoothing for the bursty nature of
> > most single user or small ISP workloads.  I believe databases style
> > loads on larger machines would also benefit.
> 
> 2 comments here.
> 
> 1.  kswapd runs in the background and wakes up every so often to handle
> the corner cases that smooth bursty memory request workloads.  it executes
> the same code that is invoked from the kernel's memory allocator to
> reclaim pages.

*nod*

The idea is that the memory_pressure variable indicates how
much page stealing is going on (on average) so every time
kswapd wakes up it knows how much pages to steal. That way
it should (if we're "lucky") free enough pages to get us
along until the next time kswapd wakes up.

> 2.  i agree with you that when the system exhausts memory, it
> hits a hard knee; it would be better to soften this.

The memory_pressure variable is there to ease this. If the load
is more or less bursty, but constant on a somewhat longer timescale
(say one minute), then we'll average the inactive_target to
somewhere between one and two seconds worth of page steals.

> can a soft-knee swapping algorithm be demonstrated that doesn't
> impact the performance of applications running on a system that
> hasn't exhausted its memory?

The algorithm we're using (dynamic inactive target w/
agressively trying to meet that target) will eat disk
bandwidth in the case of one application filling memory
really fast but not swapping, but since the data is
kept in memory, it shouldn't be a very big performance
penalty in most cases.

About NUMA scalability: we'll have different memory pools
per NUMA node. So if you have a 32-node, 64GB NUMA machine,
it'll partly function like 32 independant 2GB machines.

We'll have to find a solution for the pagecache_lock (how do
we make this more scalable?), but the pagecache_lru_lock, the
memory queues/lists and kswapd will be per _node_.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Gerrit.Huize...@us.ibm.com
Subject: Re: RFC: design for new VM
Date: 2000/08/08
Message-ID: <200008080048.RAA13326@eng2.sequent.com>#1/1
X-Deja-AN: 655647982
Sender: owner-linux-ker...@vger.rutgers.edu
References: <87256934.0078DADB.00@d53mta03h.boulder.ibm.com>
Reply-To: Gerrit.Huize...@us.ibm.com
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

> On Mon, 7 Aug 2000, Rik van Riel wrote:
> The idea is that the memory_pressure variable indicates how
> much page stealing is going on (on average) so every time
> kswapd wakes up it knows how much pages to steal. That way
> it should (if we're "lucky") free enough pages to get us
> along until the next time kswapd wakes up.

 Seems like you could signal kswapd when either the page fault
 rate increases or the rate of (memory allocations / memory
 frees) hits a tuneable? ratio (I hate relying on luck, simply
 because so much luck is bad ;-)

> About NUMA scalability: we'll have different memory pools
> per NUMA node. So if you have a 32-node, 64GB NUMA machine,
> it'll partly function like 32 independant 2GB machines.

 One lesson we learned early on is that anything you can
 possibly do on a per-CPU basis helps both SMP and NUMA
 activity.  This includes memory management, scheduling,
 TCP performance counters, any kind of system counters, etc.
 Once you have the basic SMP hierarchy in place, adding a NUMA
 hierarchy (or more than one for architectures that need it)
 is much easier.

 Also, is there a kswapd per pool?  Or does one kswapd oversee
 all of the pools (in the NUMA world, that is)?

gerrit

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Gerrit.Huize...@us.ibm.com
Subject: Re: RFC: design for new VM
Date: 2000/08/08
Message-ID: <200008080036.RAA03032@eng2.sequent.com>#1/1
X-Deja-AN: 655658668
Sender: owner-linux-ker...@vger.rutgers.edu
References: <87256934.0072FA16.00@d53mta04h.boulder.ibm.com>
Reply-To: Gerrit.Huize...@us.ibm.com
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

Hi Chuck,

> 1.  kswapd runs in the background and wakes up every so often to handle
> the corner cases that smooth bursty memory request workloads.  it executes
> the same code that is invoked from the kernel's memory allocator to
> reclaim pages.

 yep...  We do the same, although primarily through RSS management and our
 pageout deamon (separate from swapout).

 One possible difference - dirty pages are schedule for asynchronous
 flush to disk and then moved to the end of the free list after IO
 is complete.  If the process faults on that page, either before it is
 paged out or aftewrwards, it can be "reclaimed" either from the dirty
 list or the free list , without re-reading from disk.  The pageout daemon
 runs with the dirty list reaches a tuneable size, and the pageout deamon
 shrinks the list to a tuneable size, moving all written pages to the
 free list.

 In many ways, similar to what Rik is proposing, although I don't see any
 "fast reclaim" capability.  Also, the method by which pages are aged
 is quite different (global phys memory scan vs. processes maintaining
 their own LRU set).  Having a list of prime candidates to flush makes
 the kswapd/pageout overhead lower than using a global clock hand, but
 the global clock hand *may* more perform better global optimisation
 of page aging.

> 2.  i agree with you that when the system exhausts memory, it hits a hard
> knee; it would be better to soften this.  however, the VM system is
> designed to optimize the case where the system has enough memory.  in
> other words, it is designed to avoid unnecessary work when there is no
> need to reclaim memory.  this design was optimized for a desktop workload,
> like the scheduler or ext2 "async" mode.  if i can paraphrase other
> comments i've heard on these lists, it epitomizes a basic design
> philosophy: "to optimize the common case gains the most performance
> advantage."

 This works fine until I have a stable load on my system and then
 start {Netscape, StarOffice, VMware, etc.} which then causes IO for
 demand paging of the executable, as well as paging/swapping activity
 to make room for the piggish footprints of these bigger applications.

 This is where it might help to pre-write dirty pages when the system
 is more idle, without fully returning those pages to the free list.

> can a soft-knee swapping algorithm be demonstrated that doesn't impact the
> performance of applications running on a system that hasn't exhausted its
> memory?
> 
>      - Chuck Lever

 Our VM doesn't exhibit a strong knee, but its method of avoiding that
 is again the flexing RSS management.  Inactive processes tend to shrink
 to their working footprint, larger processes tend to grow to expand
 their footprint but still self-manage within the limits of available
 memory.  I think it is possible to soften the knee on a per-workload
 basis, and that's probably a spot for some tuneables.  E.g. when to
 flush dirty old pages, how many to flush, and I think Rik has already
 talked about having those tunables.

 Despite the fact that our systems have been primarily deployed for
 a single workload type (databases), we still have found that (the
 right!) VM tuneables can have an enormous impact on performance. I
 think the same will be much more true of an OS like Linux which tries
 to be many things to all people.

gerrit

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: r...@conectiva.com.br (Rik van Riel)
Subject: Re: RFC: design for new VM
Date: 2000/08/08
Message-ID: <Pine.LNX.4.21.0008081216090.5200-100000@duckman.distro.conectiva>#1/1
X-Deja-AN: 655876173
Sender: owner-linux-ker...@vger.rutgers.edu
References: <200008080048.RAA13326@eng2.sequent.com>
X-Sender: r...@duckman.distro.conectiva
X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Newsgroups: linux.dev.kernel
X-Loop: majord...@vger.rutgers.edu

On Mon, 7 Aug 2000 Gerrit.Huize...@us.ibm.com wrote:
> > On Mon, 7 Aug 2000, Rik van Riel wrote:
> > The idea is that the memory_pressure variable indicates how
> > much page stealing is going on (on average) so every time
> > kswapd wakes up it knows how much pages to steal. That way
> > it should (if we're "lucky") free enough pages to get us
> > along until the next time kswapd wakes up.
>  
>  Seems like you could signal kswapd when either the page fault
>  rate increases or the rate of (memory allocations / memory
>  frees) hits a tuneable? ratio

We will. Each page steal and each allocation will increase
the memory_pressure variable, and because of that, also the
inactive_target.

Whenever either 
- one zone gets low on free memory *OR* 
- all zones get more or less low on free+inactive_clean pages *OR*
- we get low on inactive pages (inactive_shortage > inactive_target/2),
THEN kswapd gets woken up immediately.

We do this both from the page allocation code and from
__find_page_nolock (which gets hit every time we reclaim
an inactive page back for its original purpose).

> > About NUMA scalability: we'll have different memory pools
> > per NUMA node. So if you have a 32-node, 64GB NUMA machine,
> > it'll partly function like 32 independant 2GB machines.
>  
>  One lesson we learned early on is that anything you can
>  possibly do on a per-CPU basis helps both SMP and NUMA
>  activity.  This includes memory management, scheduling,
>  TCP performance counters, any kind of system counters, etc.
>  Once you have the basic SMP hierarchy in place, adding a NUMA
>  hierarchy (or more than one for architectures that need it)
>  is much easier.
> 
>  Also, is there a kswapd per pool?  Or does one kswapd oversee
>  all of the pools (in the NUMA world, that is)?

Currently we have none of this, but once 2.5 is forked
off, I'll submit a patch which shuffles all variables
into per-node (per pgdat) structures.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/