[RFC] 2.3/4 VM queues idea

Hi,

I've been talking with some people lately and have come up with
the following plan for short-term changes to the Linux VM subsystem.

The goals of this design:
- robustness, currently small changes to the design usually result
  in breaking VM performance for everybody and there doesn't seem
  to be any particular version of the VM subsystem which works right 
  for everybody ... having a design that is more robust to both
  changes and wildly variable workloads would be better
- somewhat better page aging
- making sure we always have a "buffer" around to do allocations
  from
- treat all pages in the system equally wrt. page aging and flushing
- keeping changes to the code base simple, since we're already
  at the 2.4-pre stage !!


	DESIGN IDEAS

- have three LRU queues instead of the current one queue
  - a cache/scavenge queue, which contains clean, unmapped pages
  - a laundry queue, which contains dirty, unmapped pages
  - an inactive queue, which contains both dirty and clean unmapped
    pages
  - an active queue, which contains active and/or mapped pages
- keep a decaying average of the number of allocations per second
  around
- try to keep about one second worth of allocations around in
  the inactive queue (we do 100 allocations/second -> at least
  100 inactive pages), we do this in order to:
  - get some aging in that queue (one second to be reclaimed)
  - have enough old pages around to free
- keep zone->pages_high of free pages + cache pages around,
  with at least pages_min of really free pages for atomic
  allocations   // FIXME: buddy fragmentation and defragmentation
- pages_min can be a lot lower than what it is now, since we only
  need to use pages from the free list for atomic allocations
- non-atomic allocations take a free page if we have a lot of free
  pages, they take a page from the cache queue otherwise
- when the number of free+cache pages gets too low:
  - scan the inactive queue
	- put clean pages on the cache list
	- put dirty pages on the laundry list
	- stop when we have enough cache pages
  - the page cleaner will clean the dirty pages asynchronously
    and put them on the cache list when they are clean
	- stop when we have no more dirty pages
	- if we have dirty pages, sync them to disk,
	  periodically scanning the list to see if
	  pages are clean now

(hmm, the page cleaning thing doesn't sound completely right ...
what should I change here?)


	CODE CHANGES

- try_to_swap_out() will no longer queue a swapout, but allocate
  the swap entry and mark the page dirty
- shrink_mmap() will be split into multiple functions
	- reclaim_cache() to reclaim pages from the cache list
	- kflushd (???) could get the task of laundering pages
	- reclaim_inactive() to move inactive pages to the cached
	  and dirty list
	- refill_inactive(), which scans the active list to refill
	  the inactive list and calls swap_out() if needed
	- kswapd will refill the free list by freeing pages it
	  gets using reclaim_cache()
- __alloc_pages() will call reclaim_cache() to fulfill non-atomic
  allocations and do rmqueue() if:
	- we're dealing with an atomic allocation, or
	- we have "too many" free pages
- if an inactive, laundry or cache page is faulted back into a
  process, we reactivate the page, move the page to the active
  list, adjust the statistics and wake up kswapd if needed

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [RFC] 2.3/4 VM queues idea

     All right!  I think your spec is coming together nicely!   The
     multi-queue approach is the right way to go (for the same reason
     FBsd took that approach).  The most important aspect of using
     a multi-queue design is to *not* blow-off the page weighting tests
     within each queue.  Having N queues alone is not fine enough granularity,
     but having N queues and locating the lowest (in FreeBSD's case 0)
     weighted pages within a queue is the magic of making it work well.

     I actually tried to blow off the weighting tests in FreeBSD, even just
     a little, but when I did FreeBSD immediately started to stall as the
     load increased.  Needless to say I threw away that patchset :-).


     I have three comments:

     * On the laundry list.  In FreeBSD 3.x we laundered pages as we went
       through the inactive queue.   In 4.x I changed this to a two-pass
       algorithm (vm_pageout_scan() line 674 vm/vm_pageout.c around the 
       rescan0: label).  It tries to locate clean inactive pages in pass1,
       and if there is still a page shortage (line 927 vm/vm_pageout.c,
       the launder_loop conditional) we go back up and try again, this 
       time laundering pages.

       There is also a heuristic prior to the first loop, around line 650
       ('Figure out what to do with dirty pages...'), where it tries to 
       figure out whether it is worth doing two passes or whether it should
       just start laundering pages immediately.

     * On page aging.   This is going to be the most difficult item for you
       to implement under linux.  In FreeBSD the PV entry mmu tracking 
       structures make it fairly easy to scan *physical* pages then check
       whether they've been used or not by locating all the pte's mapping them,
       via the PV structures.  

       In linux this is harder to do, but I still believe it is the right
       way to do it - that is, have the main page scan loop scan physical 
       pages rather then virtual pages, for reasons I've outlined in previous
       emails (fairness in the weighting calculation).

       (I am *not* advocating a PV tracking structure for linux.  I really 
       hate the PV stuff in FBsd).

     * On write clustering.  In a completely fair aging design, the pages
       you extract for laundering will tend to appear to be 'random'.  
       Flushing them to disk can be expensive due to seeking.

       Two things can be done:  First, you collect a bunch of pages to be
       laundered before issuing the I/O, allowing you to sort the I/O
       (this is what you suggest in your design ideas email).  (p.p.s.
       don't launder more then 64 or so pages at a time, doing so will just
       stall other processes trying to do normal I/O).

       Second, you can locate other pages nearby the ones you've decided to
       launder and launder them as well, getting the most out of the disk
       seeking you have to do anyway.

       The first item is important.  The second item will help extend the
       life of the system in a heavy-load environment by being able to
       sustain a higher pagout rate.  

       In tests with FBsd, the nearby-write-clustering doubled the pageout
       rate capability under high disk load situations.  This is one of the
       main reasons why we do 'the weird two-level page scan' stuff.

       (ok to reprint this email too!)

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [RFC] 2.3/4 VM queues idea

On Wed, 24 May 2000, Matthew Dillon wrote:

>      All right!  I think your spec is coming together nicely!   The
>      multi-queue approach is the right way to go (for the same reason
>      FBsd took that approach).  The most important aspect of using
>      a multi-queue design is to *not* blow-off the page weighting tests
>      within each queue.  Having N queues alone is not fine enough granularity,
>      but having N queues and locating the lowest (in FreeBSD's case 0)
>      weighted pages within a queue is the magic of making it work well.
> 
>      I actually tried to blow off the weighting tests in FreeBSD, even just
>      a little, but when I did FreeBSD immediately started to stall as the
>      load increased.  Needless to say I threw away that patchset :-).

OK, I'll look at implementing this in Linux as well. Maybe it
won't work due to the virtual page scanning, but I'll look into
it and try a few things. This change should be relatively easy
to make.

>      I have three comments:
> 
>      * On the laundry list.  In FreeBSD 3.x we laundered pages as we went
>        through the inactive queue.   In 4.x I changed this to a two-pass
>        algorithm (vm_pageout_scan() line 674 vm/vm_pageout.c around the 
>        rescan0: label).  It tries to locate clean inactive pages in pass1,
>        and if there is still a page shortage (line 927 vm/vm_pageout.c,
>        the launder_loop conditional) we go back up and try again, this 
>        time laundering pages.
> 
>        There is also a heuristic prior to the first loop, around line 650
>        ('Figure out what to do with dirty pages...'), where it tries to 
>        figure out whether it is worth doing two passes or whether it should
>        just start laundering pages immediately.

Another good idea to implement. I don't know in how far it'll
interfere with the "age one second" idea though...
(maybe we want to have the inactive list "4 seconds big" so we
can implement this FreeBSD idea and still keep our second of aging?)

>      * On page aging.   This is going to be the most difficult item for you
>        to implement under linux.  In FreeBSD the PV entry mmu tracking 
>        structures make it fairly easy to scan *physical* pages then check
>        whether they've been used or not by locating all the pte's mapping them,
>        via the PV structures.  
> 
>        In linux this is harder to do, but I still believe it is the right
>        way to do it - that is, have the main page scan loop scan physical 
>        pages rather then virtual pages, for reasons I've outlined in previous
>        emails (fairness in the weighting calculation).
> 
>        (I am *not* advocating a PV tracking structure for linux.  I really 
>        hate the PV stuff in FBsd).

This is something for the 2.5 kernel. The changes involved in
doing this are just too invasive right now...

>      * On write clustering.  In a completely fair aging design, the pages
>        you extract for laundering will tend to appear to be 'random'.  
>        Flushing them to disk can be expensive due to seeking.
> 
>        Two things can be done:  First, you collect a bunch of pages to be
>        laundered before issuing the I/O, allowing you to sort the I/O
>        (this is what you suggest in your design ideas email).  (p.p.s.
>        don't launder more then 64 or so pages at a time, doing so will just
>        stall other processes trying to do normal I/O).
> 
>        Second, you can locate other pages nearby the ones you've decided to
>        launder and launder them as well, getting the most out of the disk
>        seeking you have to do anyway.

Virtual page scanning should provide us with some of these
benefits. Also, we'll allocate the swap entry at unmapping
time and can make sure to unmap virtually close pages at
the same time so they'll end up close to each other in the
inactive queue.

This isn't going to be as good as it could be, but it's
probably as good as it can get without getting more invasive
with our changes to the source tree...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [RFC] 2.3/4 VM queues idea

    Virtual page scanning will help with clustering, but unless you
    already have a good page candidate to base your virtual scan on
    you will not be able to *find* a good page candidate to base the
    clustering around.  Or at least not find one easily.  Virtual
    page scanning has severe scaleability problems over physical page
    scanning.  For example, what happens when you have an oracle database
    running with a hundred independant (non-threaded) processes mapping
    300MB+ of shared memory?

    On the swap allocation -- I think there are several approaches to this
    problem, all equally viable.  If you do not do object clustering for
    pageouts then allocating the swap at unmap time is viable -- due to
    the time delay between unmap and the actual I/O & page-selection
    for cleaning, your pageouts will be slower but you *will* get locality
    of reference on your pageins (pageins will be faster).

    If you do object clustering then you get the benefit of both worlds.
    FreeBSD delays swap allocation until it actually decides to swap
    something, which means that it can take a collection of unrelated
    pages to be cleaned and assign contiguous swap to them.  This results
    in a very fast, deterministic pageout capability but, without clustering,
    there will be no locality of reference for pageins.  So pageins would
    be slow.

    With clustering, however, the at-swap-time allocation tends to have more
    locality of reference due to there being additional nearby pages 
    selected from the objects in the mix.  It still does not approach the
    performance you can get from an object-oriented swap allocation scheme,
    but at least it would no longer be considered 'slow'.

    So it can be a toss-up.  I don't think *anyone* (linux, freebsd, solaris,
    or anyone else) has yet written the definitive swap allocation algorithm!

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [RFC] 2.3/4 VM queues idea

On Wed, 24 May 2000, Matthew Dillon wrote:

> :>        Two things can be done:  First, you collect a bunch of pages to be
> :>        laundered before issuing the I/O, allowing you to sort the I/O
> :>        (this is what you suggest in your design ideas email).  (p.p.s.
> :>        don't launder more then 64 or so pages at a time, doing so will just
> :>        stall other processes trying to do normal I/O).
> :> 
> :>        Second, you can locate other pages nearby the ones you've decided to
> :>        launder and launder them as well, getting the most out of the disk
> :>        seeking you have to do anyway.
> :
> :Virtual page scanning should provide us with some of these
> :benefits. Also, we'll allocate the swap entry at unmapping
> :time and can make sure to unmap virtually close pages at
> :the same time so they'll end up close to each other in the
> :inactive queue.
> :
> :This isn't going to be as good as it could be, but it's
> :probably as good as it can get without getting more invasive
> :with our changes to the source tree...
> 
>     Virtual page scanning will help with clustering, but unless you
>     already have a good page candidate to base your virtual scan on
>     you will not be able to *find* a good page candidate to base the
>     clustering around.  Or at least not find one easily.  Virtual
>     page scanning has severe scaleability problems over physical page
>     scanning.  For example, what happens when you have an oracle database
>     running with a hundred independant (non-threaded) processes mapping
>     300MB+ of shared memory?

Ohhh definately. It's just that coding up the administrative changes
required to support this would be too big a change for Linux 2.4...

>     So it can be a toss-up.  I don't think *anyone* (linux, freebsd, solaris,
>     or anyone else) has yet written the definitive swap allocation algorithm!

We still have some time. There's little chance of implementing it in
Linux before kernel version 2.5, so we should have some time left to
design the "definitive" algorithm.

For now I'll be focussing on having something decent in kernel 2.4,
we really need it to be better than 2.2. Keeping the virtual
scanning but combining it with a multi-queue system for the unmapped
pages (with all mapped pages residing in the active queue) should
at least provide us with a predictable, robust and moderately good
VM subsystem for the next stable kernel series.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/