linux-ext4 - Re: improved performance in case of data journaling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Sun, 27 Dec 2020 20:06:36 -0800
From:   lokesh jaliminche <lokesh.jaliminche@...il.com>
To:     Andreas Dilger <adilger@...ger.ca>
Cc:     Jan Kara <jack@...e.cz>, Martin Steigerwald <martin@...htvoll.de>,
        Ext4 <linux-ext4@...r.kernel.org>,
        Andrew Morton <akpm@...uxfoundation.org>,
        Mauricio Faria de Oliveira <mfo@...onical.com>
Subject: Re: improved performance in case of data journaling

On Tue, Dec 22, 2020 at 2:24 PM Andreas Dilger <adilger@...ger.ca> wrote:
>
> On Dec 22, 2020, at 10:47 AM, Jan Kara <jack@...e.cz> wrote:
> >
> > Hi!
> >
> > On Thu 03-12-20 01:07:51, lokesh jaliminche wrote:
> >> Hi Martin,
> >>
> >> thanks for the quick response,
> >>
> >> Apologies from my side, I should have posted my fio job description
> >> with the fio logs
> >> Anyway here is my fio workload.
> >>
> >> [global]
> >> filename=/mnt/ext4/test
> >> direct=1
> >> runtime=30s
> >> time_based
> >> size=100G
> >> group_reporting
> >>
> >> [writer]
> >> new_group
> >> rate_iops=250000
> >> bs=4k
> >> iodepth=1
> >> ioengine=sync
> >> rw=randomwrite
> >> numjobs=1
> >>
> >> I am using Intel Optane SSD so it's certainly very fast.
> >>
> >> I agree that delayed logging could help to hide the performance
> >> degradation due to actual writes to SSD. However as per the iostat
> >> output data is definitely crossing the block layer and since
> >> data journaling logs both data and metadata I am wondering why
> >> or how IO requests see reduced latencies compared to metadata
> >> journaling or even no journaling.
> >>
> >> Also, I am using direct IO mode so ideally, it should not be using any type
> >> of caching. I am not sure if it's applicable to journal writes but the whole
> >> point of journaling is to prevent data loss in case of abrupt failures. So
> >> caching journal writes may result in data loss unless we are using NVRAM.
> >
> > Well, first bear in mind that in data=journal mode, ext4 does not support
> > direct IO so all the IO is in fact buffered. So your random-write workload
> > will be transformed to semilinear writeback of the page cache pages. Now
> > I think given your SSD storage this performs much better because the
> > journalling thread commiting data will drive large IOs (IO to the journal
> > will be sequential) and even when the journal is filled and we have to
> > checkpoint, we will run many IOs in parallel which is beneficial for SSDs.
> > Whereas without data journalling your fio job will just run one IO at a
> > time which is far from utilizing full SSD bandwidth.
> >
> > So to summarize you see better results with data journalling because you in
> > fact do buffered IO under the hood :).

That makes sense thank you!!
>
> IMHO that is one of the benefits of data=journal in the first place, regardless
> of whether the journal is NVMe or HDD - that it linearizes what would otherwise
> be a random small-block IO workload to be much friendlier to the storage.  As
> long as it maintains the "written to stable storage" semantic for O_DIRECT, I
> don't think it is a problem that the data is copied or not.  Even without the
> use of data=journal, there are still some code paths that copy O_DIRECT writes.
>
> Ideally, being able to dynamically/automatically change between data=journal
> and data=ordered depending on the IO workload (e.g. large writes go straight
> to their allocated blocks, small writes go into the journal) would be the best
> of both worlds.  High "IOPS" for workloads that need it (even on HDD), without
> overwhelming the journal device bandwidth with large streaming writes.
>
> This would tie in well with the proposed SMR patches, which allow a very large
> journal device to (essentially) transform ext4 into a log-structured filesystem
> by allowing journal shadow buffers to be dropped from memory rather than being
> pinned in RAM:
>
> https://github.com/tytso/ext4-patch-queue/blob/master/series
> https://github.com/tytso/ext4-patch-queue/blob/master/jbd2-dont-double-bump-transaction-number
> https://github.com/tytso/ext4-patch-queue/blob/master/journal-superblock-changes
> https://github.com/tytso/ext4-patch-queue/blob/master/add-journal-no-cleanup-option
> https://github.com/tytso/ext4-patch-queue/blob/master/add-support-for-log-metadata-block-tracking-in-log
> https://github.com/tytso/ext4-patch-queue/blob/master/add-indirection-to-metadata-block-read-paths
> https://github.com/tytso/ext4-patch-queue/blob/master/cleaner
> https://github.com/tytso/ext4-patch-queue/blob/master/load-jmap-from-journal
> https://github.com/tytso/ext4-patch-queue/blob/master/disable-writeback
> https://github.com/tytso/ext4-patch-queue/blob/master/add-ext4-journal-lazy-mount-option
>
>
> Having a 64GB-256GB NVMe device for the journal and handling most of the small
> IO directly to the journal, and only periodically flushing to the filesystem to
> HDD would really make those SMR disks more usable, since they are starting to
> creep into consumer/NAS devices, even when users aren't really aware of it:
>
> https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/
>
> >> So questions come to my mind are
> >> 1. why writes without journaling are having long latencies as compared to
> >>    writes requests with metadata and data journaling?
> >> 2. Since metadata journaling have relatively fewer journal writes than data
> >>    journaling why writes with data journaling is faster than no journaling and
> >>    metadata journaling mode?
> >> 3. If there is an optimization that allows data journaling to be so fast
> >>    without any risk of data loss, why the same optimization is not used in case
> >>    of metadata journaling?
> >>
> >> On Thu, Dec 3, 2020 at 12:20 AM Martin Steigerwald <martin@...htvoll.de> wrote:
> >>>
> >>> lokesh jaliminche - 03.12.20, 08:28:49 CET:
> >>>> I have been doing experiments to analyze the impact of data journaling
> >>>> on IO latencies. Theoretically, data journaling should show long
> >>>> latencies as compared to metadata journaling. However, I observed
> >>>> that when I enable data journaling I see improved performance. Is
> >>>> there any specific optimization for data journaling in the write
> >>>> path?
> >>>
> >>> This has been discussed before as Andrew Morton found that data
> >>> journalling would be surprisingly fast with interactive write workloads.
> >>> I would need to look it up in my performance training slides or use
> >>> internet search to find the reference to that discussion again.
> >>>
> >>> AFAIR even Andrew had no explanation for that. So I thought why would I
> >>> have one? However an idea came to my mind: The journal is a sequential
> >>> area on the disk. This could help with harddisks I thought at least if
> >>> if it I/O mostly to the same not too big location/file – as you did not
> >>> post it, I don't know exactly what your fio job file is doing. However the
> >>> latencies you posted as well as the device name certainly point to fast
> >>> flash storage :).
> >>>
> >>> Another idea that just came to my mind is: AFAIK ext4 uses quite some
> >>> delayed logging and relogging. That means if a block in the journal is
> >>> changed another time within a certain time frame Ext4 changes it in
> >>> memory before the journal block is written out to disk. Thus if the same
> >>> block if overwritten again and again in short time, at least some of the
> >>> updates would only happen in RAM. That might help latencies even with
> >>> NVMe flash as RAM usually still is faster.
> >>>
> >>> Of course I bet that Ext4 maintainers have a more accurate or detailed
> >>> explanation than I do. But that was at least my idea about this.
> >>>
> >>> Best,
> >>> --
> >>> Martin
> >>>
> >>>
> > --
> > Jan Kara <jack@...e.com>
> > SUSE Labs, CR
>
>
> Cheers, Andreas
>
>
>
>
>