A proposal for making ext4's journal more SMR (and flash) friendly

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* A proposal for making ext4's journal more SMR (and flash) friendly
@ 2014-01-08  5:31 Theodore Ts'o
  2014-01-08 11:43 ` Lukáš Czerner
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Theodore Ts'o @ 2014-01-08  5:31 UTC (permalink / raw
  To: linux-ext4

This is something I've discussed on our weekly conference calls, but I
think it's time that try to get it written down.

                     SMR-Friendly Journal for Ext4
                              Version 0.10
                            January 8, 2014

Goal
====

The goal is to make the write patterns used by the ext4 journal and its
metadata more friendly for hard drives using Shingled Magnetic Recording
(SMR) by significantly reducing random writes seen by the SMR drive.  It
is primarily targetting drives which are providing either Drive-Managed
or Cooperatively Managed SMR.

By removing the need for random writes, this proposal can also improve
the performance of ext4 on more flash storage devices that have a more
simplistic Flash Translation Layer (FTL), such as those found on SD and
eMMC devices.

Non-Goals
---------

This proposal does not address how data blocks are allocated.

Nor does it address files which are modified they are first created
(i.e., a random read/write workload); we assume here that for many use
cases, the use of files which are modified after they are first created
using a random write pattern is rarer than the use case where files
which are written once and then not modified until they are replaced or
deleted.

Background
==========

Singled Magnetic Recording
--------------------------

Drives using SMR technology (sometimes called shingled drives) are
broken up into zones or bands, which will typically be 32-256 MB in
size[1].  Each band has a write pointer, and it is possible to write to
each band by appending to it, but once written, it can not be rewritten,
except by resetting the write pointer to the beginning at the band and
erasing the contents of the entire band.

[1] Storage systems for Shingled Disks, Garth Gibson, SDC 2012
presentation.

For more details about why drive vendors are moving to SMR, and details
regarding the different access models that have proposed for SMR drives,
please see [2].

[2] Shingled Magentic Recording: Areal Density Increase Requires New
Data Management, by Tim Feldman and Garth Bigson, ; login:, June 2013.
Vol 38, No. 3., pg 22.

The Ext4 Journal
----------------

The ext4 file system uses a physical block journal.  This means when a
metadata block is modified, the entire metadata block is written to the
journal before the transaction is committed.  Before the transaction is
commmited, the block may not be written to the final location on disk.
Once the commit block is written, then dirty metadata blocks may get
written back to disk by Linux's buffer cache, which manages the
writeback of dirty buffers.

The journal is treated sa a circular buffer, with modified metadata
blocks and commit blocks appeneded to the end of the circular buffer.
When the all of the blocks associated with the commit at the end of the
journal have been written back to disk, the commit can be retired, and
the journal superblock can be updated to move pointer to the head of the
journal to first commit that still has dirty buffers associated with it
which are pending writeback.  (The process of retiring the oldest
commits is called "checkpointing" in the ext4 journal implementation.)

To recover from a system crash, the kernel or the file system
consistency check program starts from the beginning of the journal,
writing blocks found in the journal to their appropriate location on
disk.

For more information about the ext4 journal, please see [3].

[3]  "Journaling the Linux ext2fs Filesystem," by Stephen Tweedie, in
the Proceeding of Linux Expo '98.

Design
======

The key insight in making the ext4's metadata updates more friendly is
that the writes to the journal are ideal from the perspective of writes
to a shingled disk --- or for a flash device with a simplistic FTL, such
as those found on many eMMC devices found in mobile handsets.  It is
after the journal commit, when the updates to the allocation bitmaps,
the inode table, directory blocks, which are random writes that are less
optimal from the perspective of a Flash Translation Layer or the SMR
drive's management layer.  So we apply the Smith and Dale technique[4]:

        Patient: Doctor, it hurts when I do _this_.
        Doctor Kronkheit: Don't _do_ that.

[4] Doctor Kronkheit and His Only Living Patient, Joe Smith and
Charlie Dale, 1920's American vaudeville comedy team.

The simplest implementation of this design does not require making any
on-disk format changes.  We simply suppress the writeback of the dirty
metadata block to the file system.  Instead we keep a journal map in
memory, which maps metadata block numbers (or data block numbers if data
journalling is enabled) to a block number in the journal.

The journal is not truncated when the file system is unmounted, and so
there is no difference between mounting a file system which has been
cleanly unmounted or after a system crash.  In both case, the ext4 file
system will scan the journal, and create an in-memory data structure
which maps metadata block locations to their location in the journal.
When a metadata block (or a data block, if data journalling is enabled)
needs to be read, if the block number is found in the journal map, the
block is read from the journal instead of from its "real" location on
disk.

Eventually, we will run out of room in the journal, and so we will need
to retire commits from the head of the journal.  For each block
referenced in the commit at the head of the journal, if it is has since
been updated in a newer commit, then no action will be needed.  For a
block that has not been updated in a newer commit, there are two
choices.   The checkpoint operation could either copy the block to the
tail of the journal, or write the block back to its final / "permanent"
location on disk.   The latter is preferable if it is unlikely that the
block will needed again, or if space is needed in the journal for other
metadata blocks.   On the other hand, writing the block to the final
location on disk will entail a random write, which will be especially
expensive on SMR disks.  Some experimentation may be needed to determine
the best hueristics to use.

Avoiding Updating the Journal Superblock
----------------------------------------

The basic scheme described above has does not require any format
changes.   However, while it eliminates most of the random writes
associated with the file system metadata, the journal superblock must be
updated each time the journal layer performs a "checkpoint" operation to
retire the oldest commits from the head of the journal, so that the
starting point of the journal can be identified.

This can be avoided by modifying the commit block to include the head of
the journal at the time of the commit, and then by requiring that first
block of each zone must be a jbd2 control block.  Since each control
block contains the sequence number, the mount operation simply needs to
scan the first block in each zone to find the control block with the
highest commit ID, and then parse the journal until the last valid
commit block is found.  Once the tail of the journal has been
identified, the last commit block will contain a pointer to the head of
the journal.

Applicability to other storage technologies
===========================================

This design was originally designed to improve ext4's performance on SMR
devices.  However, it it may be helpful for flash based devices, since
it reduces the write load caused by metadata blocks, since very often
the a particular metadata block will be updated in multiple commits.
Even on a hard drive, the reduction in writes and seek traffic may be
worthwhile.

Although we will need to benchmark this new scheme, this modified
journalling scheme should be at least as efficient as the current
mechanism used in the ext4/jbd2 implementation.  If this is true, it may
make sense to this be the default.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-08  5:31 A proposal for making ext4's journal more SMR (and flash) friendly Theodore Ts'o
@ 2014-01-08 11:43 ` Lukáš Czerner
  2014-01-08 15:20   ` Theodore Ts'o
  2014-01-08 22:14 ` Jan Kara
  2014-01-09  3:55 ` Andreas Dilger
  2 siblings, 1 reply; 11+ messages in thread
From: Lukáš Czerner @ 2014-01-08 11:43 UTC (permalink / raw
  To: Theodore Ts'o; +Cc: linux-ext4

On Wed, 8 Jan 2014, Theodore Ts'o wrote:

> Date: Wed, 08 Jan 2014 00:31:05 -0500
> From: Theodore Ts'o <tytso@mit.edu>
> To: linux-ext4@vger.kernel.org
> Subject: A proposal for making ext4's journal more SMR (and flash) friendly
> 
> 
> This is something I've discussed on our weekly conference calls, but I
> think it's time that try to get it written down.

Hi Ted,

thanks a lot for sharing this. It really looks interesting and I
have couple of questions/comments bellow.

> 
>                      SMR-Friendly Journal for Ext4
>                               Version 0.10
>                             January 8, 2014
> 

--snip--

> 
> Design
> ======
> 
> The key insight in making the ext4's metadata updates more friendly is
> that the writes to the journal are ideal from the perspective of writes
> to a shingled disk --- or for a flash device with a simplistic FTL, such
> as those found on many eMMC devices found in mobile handsets.  It is
> after the journal commit, when the updates to the allocation bitmaps,
> the inode table, directory blocks, which are random writes that are less
> optimal from the perspective of a Flash Translation Layer or the SMR
> drive's management layer.  So we apply the Smith and Dale technique[4]:
> 
>         Patient: Doctor, it hurts when I do _this_.
>         Doctor Kronkheit: Don't _do_ that.
> 
> [4] Doctor Kronkheit and His Only Living Patient, Joe Smith and
> Charlie Dale, 1920's American vaudeville comedy team.
> 
> 
> The simplest implementation of this design does not require making any
> on-disk format changes.  We simply suppress the writeback of the dirty
> metadata block to the file system.  Instead we keep a journal map in
> memory, which maps metadata block numbers (or data block numbers if data
> journalling is enabled) to a block number in the journal.

So it means that we would have to have bigger journal which is
multiple zones (or bands) of size long, right ? However I assume that
the optimal journal size in this case will be very much dependent
on the workload used - for example small file workload or other metadata
heavy workloads would need bigger journal. Could we possibly make journal
size variable ?

> 
> The journal is not truncated when the file system is unmounted, and so
> there is no difference between mounting a file system which has been
> cleanly unmounted or after a system crash.

I would maybe argue that clean unmount might be the right time for
checkpoint and resetting journal head back to the beginning because
I do not see it as a performance sensitive operation. This would in
turn help us on subsequent mount and run.

> In both case, the ext4 file
> system will scan the journal, and create an in-memory data structure
> which maps metadata block locations to their location in the journal.
> When a metadata block (or a data block, if data journalling is enabled)
> needs to be read, if the block number is found in the journal map, the
> block is read from the journal instead of from its "real" location on
> disk.

While this helps a lot to avoid random writes it could possibly
result in much higher seek rates especially with bigger journals.
We're trying hard to keep data and associated metadata close
together and this would very much break that. This might be
especially bad with SMR devices because those are designed to be much
bigger in size. But of course this is a trade-off which makes it
very important to have good benchmark.

> 
> Eventually, we will run out of room in the journal, and so we will need
> to retire commits from the head of the journal.  For each block
> referenced in the commit at the head of the journal, if it is has since
> been updated in a newer commit, then no action will be needed.

I assume that the information about the newest commits for
particular metadata blocks would be kept in memory ? Otherwise it
would be quite expensive operation. But it seems unavoidable on
mount time, so it might really be better to clear the journal at
unmount when we should have all this information already in memory ?

Overall this design seems like a good idea to me and I agree that
this should help not only on SMR devices but should be generally
useful if we can determine the best heuristics to balance
trade-offs.

Thanks!
-Lukas

> For a
> block that has not been updated in a newer commit, there are two
> choices.   The checkpoint operation could either copy the block to the
> tail of the journal, or write the block back to its final / "permanent"
> location on disk.   The latter is preferable if it is unlikely that the
> block will needed again, or if space is needed in the journal for other
> metadata blocks.   On the other hand, writing the block to the final
> location on disk will entail a random write, which will be especially
> expensive on SMR disks.  Some experimentation may be needed to determine
> the best hueristics to use.
> 
> 
> Avoiding Updating the Journal Superblock
> ----------------------------------------
> 
> The basic scheme described above has does not require any format
> changes.   However, while it eliminates most of the random writes
> associated with the file system metadata, the journal superblock must be
> updated each time the journal layer performs a "checkpoint" operation to
> retire the oldest commits from the head of the journal, so that the
> starting point of the journal can be identified.
> 
> This can be avoided by modifying the commit block to include the head of
> the journal at the time of the commit, and then by requiring that first
> block of each zone must be a jbd2 control block.  Since each control
> block contains the sequence number, the mount operation simply needs to
> scan the first block in each zone to find the control block with the
> highest commit ID, and then parse the journal until the last valid
> commit block is found.  Once the tail of the journal has been
> identified, the last commit block will contain a pointer to the head of
> the journal.
> 
> Applicability to other storage technologies
> ===========================================
> 
> This design was originally designed to improve ext4's performance on SMR
> devices.  However, it it may be helpful for flash based devices, since
> it reduces the write load caused by metadata blocks, since very often
> the a particular metadata block will be updated in multiple commits.
> Even on a hard drive, the reduction in writes and seek traffic may be
> worthwhile.
> 
> Although we will need to benchmark this new scheme, this modified
> journalling scheme should be at least as efficient as the current
> mechanism used in the ext4/jbd2 implementation.  If this is true, it may
> make sense to this be the default.
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-08 11:43 ` Lukáš Czerner
@ 2014-01-08 15:20   ` Theodore Ts'o
  2014-01-08 15:45     ` Lukáš Czerner
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2014-01-08 15:20 UTC (permalink / raw
  To: Lukáš Czerner; +Cc: linux-ext4

On Wed, Jan 08, 2014 at 12:43:35PM +0100, Lukáš Czerner wrote:
>So it means that we would have to have bigger journal which is
>multiple zones (or bands) of size long, right ? However I assume that
>the optimal journal size in this case will be very much dependent
>on the workload used - for example small file workload or other metadata
>heavy workloads would need bigger journal. Could we possibly make journal
>size variable ?

The journal size is already variable, e.g., "mke2fs -J size=512M".
But yes, the optimal journal size will be highly variable.  

> > The journal is not truncated when the file system is unmounted, and so
> > there is no difference between mounting a file system which has been
> > cleanly unmounted or after a system crash.
> 
> I would maybe argue that clean unmount might be the right time for
> checkpoint and resetting journal head back to the beginning because
> I do not see it as a performance sensitive operation. This would in
> turn help us on subsequent mount and run.

Yes, maybe.  It depends on how the the SMR drive handles random
writes.  I suspect that most of the time, if the zones are close to
the 256MB to 512MB rather than 32MB, the SMR drive is not going to
rewrite the entire block just to handle a couple of random writes.  If
we are doing an unmount, and so we don't care about performance for
these random writes, if there is a way for us to hint to the SMR drive
that no, really, it really should do a full zone rewrite, even if we
are only updating a dozen blocks out of the 256MB zone, then sure,
this might be a good thing to do.

But if the SMR drive takes these random metadata writes and writes
them to some staging area, then it might not improve performance after
we reboot and remount the file system --- indeed, depending on the
location and nature of the staging area, it might make things worse.

I think we will need to do some experiments, and perhaps get some
input from SMR drive vendors.  They probably won't be willing to
release detailed design information without our being under NDA, but
we can probably explain the design, and watch how their faces grin or
twitch or scowl.  :-)

BTW, even if we do have NDA information from one vendor, it might not
necessarily follow that other vendors use the same tradeoffs.  So even
if some of us has NDA'ed information from one or two vendors, I'm a
bit hesitant about hard-coding the design based on what they tell us.
Besides the risk that one of the vendor might do things differently,
there is also the concern that future versions of the drive might use
different schemes for managing the logical->physical translation
layer.  So we will probably want to keep our implementation and design
flexible.

> While this helps a lot to avoid random writes it could possibly
> result in much higher seek rates especially with bigger journals.
> We're trying hard to keep data and associated metadata close
> together and this would very much break that. This might be
> especially bad with SMR devices because those are designed to be much
> bigger in size. But of course this is a trade-off which makes it
> very important to have good benchmark.

While the file system is mounted, if the metadata block is being
referenced frequently, it will be in kept in memory, so the fact that
it would have to seek to some random journal location if we need to
read that metadata block might not be a big deal.  (This is similar to
the argument used by log-structured file systems which claims that if
we have enough memory, the fact that the metadata is badly fragmented
doesn't matter.  Yes, if we are under heavy memory pressure, it might
not work out.)

> I assume that the information about the newest commits for
> particular metadata blocks would be kept in memory ? Otherwise it
> would be quite expensive operation. But it seems unavoidable on
> mount time, so it might really be better to clear the journal at
> unmount when we should have all this information already in memory ?

Information about all commits and what blocks are still associated
with them is alreaady being kept in memory.  Currently this is being
done via a jh/bh; we'd want to do this differently, since we wouldn't
necessarily enforce that all blocks which are in the journal must be
in the buffer cache.  (Although if we did keep all blocks in the
journal in the buffer cache, it would address the issue you raised
above, at the expense of using a large amount of memory --- more
memory than we would be comfortable using, although I'd bet is still
less memory than, say, ZFS requires. :-)

       	      	    	      	      	     	- Ted

P.S.  One other benefit of this design which I forgot to mention in
this version of the draft.  Using this scheme would also allow us to
implement true read-only mounts and file system checks, without
requiring that we modify the file system by replaying the journal
before proceeding with the mount or e2fsck run.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-08 15:20   ` Theodore Ts'o
@ 2014-01-08 15:45     ` Lukáš Czerner
  0 siblings, 0 replies; 11+ messages in thread
From: Lukáš Czerner @ 2014-01-08 15:45 UTC (permalink / raw
  To: Theodore Ts'o; +Cc: linux-ext4

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6152 bytes --]

On Wed, 8 Jan 2014, Theodore Ts'o wrote:

> Date: Wed, 8 Jan 2014 10:20:37 -0500
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org
> Subject: Re: A proposal for making ext4's journal more SMR (and flash)
>     friendly
> 
> On Wed, Jan 08, 2014 at 12:43:35PM +0100, Lukáš Czerner wrote:
> >So it means that we would have to have bigger journal which is
> >multiple zones (or bands) of size long, right ? However I assume that
> >the optimal journal size in this case will be very much dependent
> >on the workload used - for example small file workload or other metadata
> >heavy workloads would need bigger journal. Could we possibly make journal
> >size variable ?
> 
> The journal size is already variable, e.g., "mke2fs -J size=512M".
> But yes, the optimal journal size will be highly variable.  

Yes, but I meant variable while the file system is mounted. With some
boundaries of course. But I guess we'll have to think about it once
we actually have some code done and hardware to test on.

I am just mentioning this because there is a possibility of this
being a problem and I would not want users to pick the right journal
size for every file system.

> 
> > > The journal is not truncated when the file system is unmounted, and so
> > > there is no difference between mounting a file system which has been
> > > cleanly unmounted or after a system crash.
> > 
> > I would maybe argue that clean unmount might be the right time for
> > checkpoint and resetting journal head back to the beginning because
> > I do not see it as a performance sensitive operation. This would in
> > turn help us on subsequent mount and run.
> 
> Yes, maybe.  It depends on how the the SMR drive handles random
> writes.  I suspect that most of the time, if the zones are close to
> the 256MB to 512MB rather than 32MB, the SMR drive is not going to
> rewrite the entire block just to handle a couple of random writes.  If
> we are doing an unmount, and so we don't care about performance for
> these random writes, if there is a way for us to hint to the SMR drive
> that no, really, it really should do a full zone rewrite, even if we
> are only updating a dozen blocks out of the 256MB zone, then sure,
> this might be a good thing to do.
> 
> But if the SMR drive takes these random metadata writes and writes
> them to some staging area, then it might not improve performance after
> we reboot and remount the file system --- indeed, depending on the
> location and nature of the staging area, it might make things worse.
> 
> I think we will need to do some experiments, and perhaps get some
> input from SMR drive vendors.  They probably won't be willing to
> release detailed design information without our being under NDA, but
> we can probably explain the design, and watch how their faces grin or
> twitch or scowl.  :-)
> 
> BTW, even if we do have NDA information from one vendor, it might not
> necessarily follow that other vendors use the same tradeoffs.  So even
> if some of us has NDA'ed information from one or two vendors, I'm a
> bit hesitant about hard-coding the design based on what they tell us.
> Besides the risk that one of the vendor might do things differently,
> there is also the concern that future versions of the drive might use
> different schemes for managing the logical->physical translation
> layer.  So we will probably want to keep our implementation and design
> flexible.

I very much agree, firmware implementation of those drives will
probably change a lot during first generations.

> 
> > While this helps a lot to avoid random writes it could possibly
> > result in much higher seek rates especially with bigger journals.
> > We're trying hard to keep data and associated metadata close
> > together and this would very much break that. This might be
> > especially bad with SMR devices because those are designed to be much
> > bigger in size. But of course this is a trade-off which makes it
> > very important to have good benchmark.
> 
> While the file system is mounted, if the metadata block is being
> referenced frequently, it will be in kept in memory, so the fact that
> it would have to seek to some random journal location if we need to
> read that metadata block might not be a big deal.  (This is similar to
> the argument used by log-structured file systems which claims that if
> we have enough memory, the fact that the metadata is badly fragmented
> doesn't matter.  Yes, if we are under heavy memory pressure, it might
> not work out.)
> 
> > I assume that the information about the newest commits for
> > particular metadata blocks would be kept in memory ? Otherwise it
> > would be quite expensive operation. But it seems unavoidable on
> > mount time, so it might really be better to clear the journal at
> > unmount when we should have all this information already in memory ?
> 
> Information about all commits and what blocks are still associated
> with them is alreaady being kept in memory.  Currently this is being
> done via a jh/bh; we'd want to do this differently, since we wouldn't
> necessarily enforce that all blocks which are in the journal must be
> in the buffer cache.  (Although if we did keep all blocks in the
> journal in the buffer cache, it would address the issue you raised
> above, at the expense of using a large amount of memory --- more
> memory than we would be comfortable using, although I'd bet is still
> less memory than, say, ZFS requires. :-)

I did not even think about always keeping those blocks in memory, it
should definitely be a subject to memory reclaim. With big enough
journal and the right workload this could grow out of proportions :)

> 
>        	      	    	      	      	     	- Ted
> 
> P.S.  One other benefit of this design which I forgot to mention in
> this version of the draft.  Using this scheme would also allow us to
> implement true read-only mounts and file system checks, without
> requiring that we modify the file system by replaying the journal
> before proceeding with the mount or e2fsck run.

Right, that would be nice side effect.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-08  5:31 A proposal for making ext4's journal more SMR (and flash) friendly Theodore Ts'o
  2014-01-08 11:43 ` Lukáš Czerner
@ 2014-01-08 22:14 ` Jan Kara
  2014-01-08 23:37   ` Theodore Ts'o
  2014-01-09  3:55 ` Andreas Dilger
  2 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2014-01-08 22:14 UTC (permalink / raw
  To: Theodore Ts'o; +Cc: linux-ext4

On Wed 08-01-14 00:31:05, Ted Tso wrote:
> The simplest implementation of this design does not require making any
> on-disk format changes.  We simply suppress the writeback of the dirty
> metadata block to the file system.  Instead we keep a journal map in
> memory, which maps metadata block numbers (or data block numbers if data
> journalling is enabled) to a block number in the journal.
> 
> The journal is not truncated when the file system is unmounted, and so
> there is no difference between mounting a file system which has been
> cleanly unmounted or after a system crash.  In both case, the ext4 file
> system will scan the journal, and create an in-memory data structure
> which maps metadata block locations to their location in the journal.
> When a metadata block (or a data block, if data journalling is enabled)
> needs to be read, if the block number is found in the journal map, the
> block is read from the journal instead of from its "real" location on
> disk.
  So when I was thinking about this (already couple of years ago) the thing
which stopped me was the question at which layer we should do the
translation. Ideally we would need something at submit_bh() level but just
wrapping submit_bh() calls inside ext4 isn't enough for stuff like symlinks
or journalled data... Do you have any thoughts on that?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-08 22:14 ` Jan Kara
@ 2014-01-08 23:37   ` Theodore Ts'o
  2014-01-10  6:04     ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2014-01-08 23:37 UTC (permalink / raw
  To: Jan Kara; +Cc: linux-ext4

On Wed, Jan 08, 2014 at 11:14:30PM +0100, Jan Kara wrote:
>   So when I was thinking about this (already couple of years ago) the thing
> which stopped me was the question at which layer we should do the
> translation. Ideally we would need something at submit_bh() level but just
> wrapping submit_bh() calls inside ext4 isn't enough for stuff like symlinks
> or journalled data... Do you have any thoughts on that?

I think there are two interfaces that should handle nearly all of our
journal block mapping needs.  The functions that issue bio requests
directly tend to use ext4_get_block*() functions, and functions which
use the buffer cache uses submit_bh() (typically via ext4_getblk).
There will probably be a few exceptions, but I don't think this should
be an intractable problem.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-08  5:31 A proposal for making ext4's journal more SMR (and flash) friendly Theodore Ts'o
  2014-01-08 11:43 ` Lukáš Czerner
  2014-01-08 22:14 ` Jan Kara
@ 2014-01-09  3:55 ` Andreas Dilger
  2014-01-09 13:41   ` Theodore Ts'o
  2 siblings, 1 reply; 11+ messages in thread
From: Andreas Dilger @ 2014-01-09  3:55 UTC (permalink / raw
  To: Theodore Ts'o; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 3738 bytes --]

On Jan 7, 2014, at 10:31 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> This is something I've discussed on our weekly conference calls, but I
> think it's time that try to get it written down.
> 
>                     SMR-Friendly Journal for Ext4
>                              Version 0.10
>                            January 8, 2014
> 
> Design
> ======
> 
> The simplest implementation of this design does not require making any
> on-disk format changes.  We simply suppress the writeback of the dirty
> metadata block to the file system.  Instead we keep a journal map in
> memory, which maps metadata block numbers (or data block numbers if data
> journalling is enabled) to a block number in the journal.
> 
> The journal is not truncated when the file system is unmounted, and so
> there is no difference between mounting a file system which has been
> cleanly unmounted or after a system crash.  In both case, the ext4 file
> system will scan the journal, and create an in-memory data structure
> which maps metadata block locations to their location in the journal.
> When a metadata block (or a data block, if data journalling is enabled)
> needs to be read, if the block number is found in the journal map, the
> block is read from the journal instead of from its "real" location on
> disk.
> 
> Eventually, we will run out of room in the journal, and so we will need
> to retire commits from the head of the journal.  For each block
> referenced in the commit at the head of the journal, if it is has since
> been updated in a newer commit, then no action will be needed.  For a
> block that has not been updated in a newer commit, there are two
> choices.   The checkpoint operation could either copy the block to the
> tail of the journal, or write the block back to its final / "permanent"
> location on disk.   The latter is preferable if it is unlikely that the
> block will needed again, or if space is needed in the journal for other
> metadata blocks.   On the other hand, writing the block to the final
> location on disk will entail a random write, which will be especially
> expensive on SMR disks.  Some experimentation may be needed to determine
> the best hueristics to use.

I've been thinking about something like this for a long time already,
in the context of using a flash/NVRAM device for an external journal,
instead of in the context of SMR, but I think the results are the same.

Since even small flash drives are in the 10s of GB in size, it would be
very useful to use them for log-structured writes to avoid seeks on the
spinning disks.  One would certainly hope that in the age of multi-TB
SMR devices that manufacturers would be smart enough to include a few GB
of flash/NVRAM on board to take the majority of the pain away from using
SMR directly for anything other than replacements for tape drives.

One important change needed for ext4/jbd2 is that buffers in the journal
can be unpinned from RAM before they are checkpointed.  Otherwise, jbd2
requires potentially as much RAM as the journal size.  With a flash or
NVRAM journal device that is not a problem to do random reads to fetch
the data blocks back if they are pushed out of cache.  With an SMR disk
this could potentially be a big slowdown to do random reads from the
journal just at the same time that it is doing random checkpoint writes.

Similarly, with NVRAM journal there is no need to order writes inside
the journal, but with SMR there may be a need to "allocate" blocks in
the journal in some sensible order to avoid pathalogical random seeks
for every single block.  I don't think it will be practical in many
cases to pin the buffers in memory for more than the few seconds that
JBD already does today.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-09  3:55 ` Andreas Dilger
@ 2014-01-09 13:41   ` Theodore Ts'o
  2014-01-10  8:14     ` Andreas Dilger
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2014-01-09 13:41 UTC (permalink / raw
  To: Andreas Dilger; +Cc: Ext4 Developers List

On Wed, Jan 08, 2014 at 08:55:30PM -0700, Andreas Dilger wrote:
> Since even small flash drives are in the 10s of GB in size, it would be
> very useful to use them for log-structured writes to avoid seeks on the
> spinning disks.  One would certainly hope that in the age of multi-TB
> SMR devices that manufacturers would be smart enough to include a few GB
> of flash/NVRAM on board to take the majority of the pain away from using
> SMR directly for anything other than replacements for tape drives.

Certainly, if we could have, say, 1GB per TB of flash that was
accessible to the OS (i.e., not used for SMR's internal physical to
logical mapping), this would be a huge help.

> One important change needed for ext4/jbd2 is that buffers in the journal
> can be unpinned from RAM before they are checkpointed.  Otherwise, jbd2
> requires potentially as much RAM as the journal size.  With a flash or
> NVRAM journal device that is not a problem to do random reads to fetch
> the data blocks back if they are pushed out of cache.  With an SMR disk
> this could potentially be a big slowdown to do random reads from the
> journal just at the same time that it is doing random checkpoint writes.

There's another question that this brings up.  Depending on how:

	* Whether the journal is in flash or not
	* How busy the SMR drive is
	* The likelihood that the block will need to be modified in the future

etc., we may be better off forcing that block to its final location on
disk, instead of letting it get pushed out of memory, only to have to
reread it back in when it comes time to checkpoint the file.

For example, if we are unpacking a large tar file, or the distribution
is installing a large number of files, once an inode table block is
filled, we probably won't need to modify it in the future (modulo
atime updates[1]) so we probably should just write it to the inode table
at that point.  (Or we could possibly wait until we have multiple
consecutive inode table blocks, and then write them all to the disk at
the sasme time.)

But in order to do this, we need to have something different from LRU
--- we actually need to track LRM: "least recently modified", since it
doesn't matter if a directory block is getting referenced a lot; if it
hasn't been modified in a while, and we have a series of adjacent
blocks that are all ready to be written out, maybe we should more
aggressively get them out to the disk, especially if the disk is
relatively idle at the moment.

> Similarly, with NVRAM journal there is no need to order writes inside
> the journal, but with SMR there may be a need to "allocate" blocks in
> the journal in some sensible order to avoid pathalogical random seeks
> for every single block.  I don't think it will be practical in many
> cases to pin the buffers in memory for more than the few seconds that
> JBD already does today.

Determining the best order to write the blocks into the journal at
commit time is going to be tricky since we want to keep the layering
guarantees between the jbd2 and ext4 layers.  I'm also not entirely
sure how much this will actually buy us.  If we are worrying about
seeks when we need to read related metadata blocks, if the blocks are
used frequently, the LRU algorithms wil keep them in memory, so this
is really only a cold cache startup issue.  Also, if we think about
the most common cases where we need to read multiple metadata blocks,
it's the case of a directory block followed by an inode table block,
or an inode table block followed by an extent tree block.  In both of
these cases, the blocks will be "close" to one another, but there is
absolutely no guarantee that they will be adjacent.  So reordering the
blocks within the tens or hundreds of blocks that need to be written
as part of the journal commit may not be something that's worth a lot
of complexity.  Just by virtue of the fact that they are located
within the same commit means that the metadata blocks will be "close"
together.

So I think this is something we can look at as a later optimization /
refinement.

The first issue which you raised, that of handling the case where a
buffer that hasn't yet been checkpointed is under memory pressure and
how do we handle it, is also an optimization question, but I think
that's a higher priority item for us to consider.

Cheers,

							- Ted

[1] Another design tangent: with SMR drives, it's clear that atime
updates are going to be a big deal.  So the question is how much will
our users really care about atime.  Can we just simply say, "use
noatime", or should we think about some way of handling atime updates
specially?  (For example, we could track atime updates separately, and
periodically include in the journal a list of inode numbers and their
real atimes.)  This is not something we should do early on --- it's
another later optional enhancement --- but it is something to think
about.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-08 23:37   ` Theodore Ts'o
@ 2014-01-10  6:04     ` Jan Kara
  2014-01-10 16:32       ` Theodore Ts'o
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2014-01-10  6:04 UTC (permalink / raw
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4

On Wed 08-01-14 18:37:52, Ted Tso wrote:
> On Wed, Jan 08, 2014 at 11:14:30PM +0100, Jan Kara wrote:
> >   So when I was thinking about this (already couple of years ago) the thing
> > which stopped me was the question at which layer we should do the
> > translation. Ideally we would need something at submit_bh() level but just
> > wrapping submit_bh() calls inside ext4 isn't enough for stuff like symlinks
> > or journalled data... Do you have any thoughts on that?
> 
> I think there are two interfaces that should handle nearly all of our
> journal block mapping needs.  The functions that issue bio requests
> directly tend to use ext4_get_block*() functions, and functions which
> use the buffer cache uses submit_bh() (typically via ext4_getblk).
> There will probably be a few exceptions, but I don't think this should
> be an intractable problem.
  Surely not intractable :) It was just ugly. But you are right that
hooking in ext4_map_blocks() and then special-casing the few cases where we
get the block number by different means (xattrs, inode table, group
descriptor, superblock, traversal of extent tree & indirect block tree)
should be reasonably elegant.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-09 13:41   ` Theodore Ts'o
@ 2014-01-10  8:14     ` Andreas Dilger
  0 siblings, 0 replies; 11+ messages in thread
From: Andreas Dilger @ 2014-01-10  8:14 UTC (permalink / raw
  To: Theodore Ts'o; +Cc: Ext4 Developers List

On Jan 9, 2014, at 6:41, Theodore Ts'o <tytso@mit.edu> wrote:
> [Another design tangent: with SMR drives, it's clear that atime
> updates are going to be a big deal.  So the question is how much will
> our users really care about atime.  Can we just simply say, "use
> noatime", or should we think about some way of handling atime updates
> specially?  (For example, we could track atime updates separately, and
> periodically include in the journal a list of inode numbers and their
> real atimes.)  This is not something we should do early on --- it's
> another later optional enhancement --- but it is something to think
> about.

We already have noatime and relatime, since atime hurts on
spinning disks as well.  It wouldn't be impossible to do the atime updates in
the journal initially, and only write the inodes to their final resting place after
enough have been accumulated. 

I think when there are many atime updates together (e.g. "grep -r") it
will be reasonable to checkpoint them to the filesystem, and when
there are only a few those inodes could be rewritten to the journal.

I'm thinking the same could be done with data journaling for small files. 
It makes sense to write a small file's data to the journal, but write large
file data directly to its final resting place. 

Cheers, Andreas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: A proposal for making ext4's journal more SMR (and flash) friendly
  2014-01-10  6:04     ` Jan Kara
@ 2014-01-10 16:32       ` Theodore Ts'o
  0 siblings, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2014-01-10 16:32 UTC (permalink / raw
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jan 10, 2014 at 07:04:29AM +0100, Jan Kara wrote:
> > I think there are two interfaces that should handle nearly all of our
> > journal block mapping needs.  The functions that issue bio requests
> > directly tend to use ext4_get_block*() functions, and functions which
> > use the buffer cache uses submit_bh() (typically via ext4_getblk).
> > There will probably be a few exceptions, but I don't think this should
> > be an intractable problem.
>
>   Surely not intractable :) It was just ugly. But you are right that
> hooking in ext4_map_blocks() and then special-casing the few cases where we
> get the block number by different means (xattrs, inode table, group
> descriptor, superblock, traversal of extent tree & indirect block tree)
> should be reasonably elegant.

Yeah, what makes this tricky is that you want to use the "real" block
number for writing (but then write the block into the journal and not
the final location on disk), but the "journal" block for reading.
Whether we put the phys->journal block mapping function in
ext4_map_blocks() triggered via Yet Another Ext4_Map_BlocksFlag, or
via a separate function is a reasonable question (although you can
probably guess I favor the latter).  But that's at the low level.  

In terms of what's above the ext4_map_blocks() layer, that's why I
suggested the ext4_get_block*() functions --- which are used almost
exclusively for reads and direct I/O (for DIO writes we will want to
force the blocks to their final location on disk, I suspect, since
otherwise we will break any journal checksum feature we might have
enabled.  OTOH, DIO writes are for files that are being modified via a
random write pattern, and these are going to be disastrous for SMR
disks anyway) --- and submit_bh() for most of the ext4 metadata
read/writes calls.

						- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-01-10 16:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-08  5:31 A proposal for making ext4's journal more SMR (and flash) friendly Theodore Ts'o
2014-01-08 11:43 ` Lukáš Czerner
2014-01-08 15:20   ` Theodore Ts'o
2014-01-08 15:45     ` Lukáš Czerner
2014-01-08 22:14 ` Jan Kara
2014-01-08 23:37   ` Theodore Ts'o
2014-01-10  6:04     ` Jan Kara
2014-01-10 16:32       ` Theodore Ts'o
2014-01-09  3:55 ` Andreas Dilger
2014-01-09 13:41   ` Theodore Ts'o
2014-01-10  8:14     ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.