Journaling FFS with WAPBL
Jörg Sonnenberger
Overview
- A short introduction to FFS
- WAPBL: Overview
- WAPBL: In-depth
- Performance
- Open issues
- Questions
A short introduction to FFS
- Superblock
- Inodes
- Directories
- Cylinder groups
- Consistency requirements
The FFS superblock
- Description of the filesystem
- Block size, fragment size, number of blocks, etc
- Time of last mount and if unmounted cleanly
- Summary of filesystem content
- Stored redundantly to protect against bad blocks etc
- Different versions, some fields added, some killed
- dumpfs(8) tells the version (FFSv2 for WAPBL!)
Inodes
- The file content, not the file name
- 128 Bytes for FFSv1, 256 Bytes for FFSv2
- Link count, time stamps, size, flags, ownership, ...
- References to the first 12 blocks and indirect blocks for the rest
- Last block can be partially allocated: fragments
- Not all blocks have to be allocated: holes
- Inodes never end with holes
- Extended Attribute block for FFSv2
Directories
- Records of inode number, record len, file type, name
- Padded to block boundaries
- "." and ".." as special entries
Cylinder groups
- Distribute files over disk, reducing fragmentation
- Contain fixed size inode lists
- Contain free space bitmaps
- Contain superblock copy
Consistency requirements
- Superblocks have to stay in sync
- Cylinder groups need consistent summaries and bitmaps
- Inodes must be freed once link count reaches 0
- Inodes must have indirect blocks written before writting the pointer
- Inodes must be initialized before creating directory entries
- Inode reference count must be modified on link(2) and unlink(2)
Practical example: mkdir(2)
- Allocate free inode
- Allocate block by marking it as used in the bitmap
- Write directory template with "." and ".." entry
- Increment reference count of parent directory
- Write inode to disk with allocated block referenced and ref count 2
- Write directory entry to parent directory
- Update statistics
WAPBL: Goals
- Crash recovery without fsck
- Improve performance by reducing synchronisation
- Potentially reduce number of disk seeks by allowing aggregation
- Simpler and less error prone than Soft Updates
- Trivial to use: mount -o log ...
WAPBL: Components
- The generic WAPBL backend
- Integration into FFS
Overview: The WAPBL backend
- Journal writing and replaying
- Journal records:
- Block entry
- Revocation of earlier journaled blocks
- List of unreferenced allocated inodes
- bwrite / bdwrite registers buffer and defer writing
In-depth: Journal layout
- Circular buffer of records
- Header block at the start and the end of the log area
- Headers are written alternatively with generation counter
- Newer header determines newest valid and oldest active record
- Explicit disk synchronistation after all writes
In-depth: Journal layout (II)
- Block entries: to be written to given location after crash
- Block revocation: when changing from meta data to data block
- Unreferenced allocated inode:
- During initialisation: mode = 0
- Unlinked, but still open: mode != 0
In-depth: Journal replay
- Process all journal entries in order:
- Block entries: add to hash table
- Revocation entries: remove entries from hash table again
- Unreferenced inodes: keep last entry
- If not mounting read-only, write all blocks back to disk
- Call filesystem backend for unreferenced inodes
- Shared code between kernel and fsck
Overview: FFS integration
- Journal location in superblock
- Registration of inode allocation and freeing
- Registration after freeing meta data blocks
- Annotate transaction borders
- Allocation of journal
- Journal replay on mount
Journal location
- End of partition:
- Size limited only by disk space
- Disk address, size and block size stored in superblock
- In-filesystem:
- Limited to size of cylinder group
- Address, size, block size and inode number in superblock
- On mount, journal is created on-demand:
- At the end, if enough free space (1MB journal per 1GB size)
- Inside the filesystem (up to 64MB, at least 1MB)
In-depth: mkdir(2)
- -> sys_mkdir
- -> ufs_mkdir
- Allocate and register new inode:
ffs_valloc: UFS_WAPBL_BEGIN + ffs_nodealloccg + UFS_WAPBL_END
- UFS_WAPBL_BEGIN
- UFS_UPDATE -> unregister inode again
- (write template)
- UFS_WAPBL_END
In-depth: mkdir(2) journal record
- First transaction:
- Cylinder group updates (Block entry)
- Inode update (Block entry)
- Unreferenced inode list
- Second transaction:
- Inode update (Block entry)
- Inode update for parent (Block entry)
- Directory content (Block entry)
- Unreferenced inode list
In-depth: ffs_write
- Can be called from inside the filesystem code or from sys_write/vn_write
- UFS_WAPBL_BEGIN if not already inside a transaction
- -> VOP_PUTPAGES
- UFS_WAPBL_END if started earlier
Performance: test system
- HP ProLiant ML110
- Xeon 3040 @1.86GHz
- 2GB memory
- Test on dedicated SATA disk, write caching enabled
- OpenSuSE 11.1 and NetBSD 5.0
Performance (I): 10x pkgsrc.tar.bz2
Performance (II): build.sh release
Open issues
- No checksum of journal entries
- Too much data flushing
- Too much serialisation of writes
- Holding the journal locked over UBC operations
- No data ordering
- Support for external journal