FreeMiNT's low level buffer cache
=================================

last update: 2000-10-28
Author: Frank Naumann <fnaumann@freemint.de>
notes:


I. Introduction
---------------

FreeMiNT 1.15 has a new global block cache. It's currently 
used from the NEWFATFS and MinixFS 0.70.

The cache is global and does most things automatically.
It's very easy to support it and reduces also
programming overhead. For example, I added new block
cache support in MinixFS. For this I completely removed
the existing cache management in MinixFS and replaced
most of the calls read/write buffered blocks. This
reduces the binary size from 39 kb to 26 kb. Also the
cache management is very efficient and speeds up some 
operations on MinixFS (I made some tests with
MinixFS 0.60 and MinixFS 0.70).

The cache can be increased at boot time with 
the configuration keyword "CACHE=<size in kb>" in MiNT 
configuration file.
For example: "CACHE=500" sets the cache to a size of
500 kb (if enough memory is available).
Default cache size is 100 kb. It's recommended to increase
the cache if you use many MinixFS 0.70 and NEWFATFS
partitions. Currently, the cache is first allocated from
TT-RAM and then from ST-RAM.

The cache is static. But if in the future
the cache becomes dynamic, all xfs that support the
new cache management, will remain compatible and actually will 
support any improvements.

Note for removable medias: the cache automatically locks the
drive if there are unwritten sectors in cache.


II. Definition
--------------

call conventions:
- all arguments are on the stack
- return value is stored in d0
(cdecl call)

return value conventions:
- negative return values are ATARI error codes
- E_OK for succes

type conventions:

char			8 bit signed
unsigned char	8 bit unsigned
short			16 bit signed integer
unsigned short	16 bit unsigned integer
long			32 bit signed integer
unsigned long	32 bit unsigned integer
llong			64 bit signed integer
ullong		64 bit unsigned integer

with:

typedef struct { long hi; unsigned long low; } llong;
typedef struct { unsigned long hi; unsigned long low; } ullong;


III. interface description
--------------------------

1. introduction
===============

For the interface you need include/block_IO.h and some of the updated
FreeMiNT header files.

The kernel structure that is passed to a loadable XFS is extended
with a pointer to the block_IO functions.

See in MinixFS 0.70 for an example (kernel.h, main.c). The 
pointer is valid since FreeMiNT 1.15.0. This must be 
checked first before a XFS dereferences the pointer.

The block_IO function is a structure that contains various 
data fields and function pomiter:

struct bio
{
	ushort	version;		/* buffer cache version */
	ushort	revision;		/* buffer cache revision */
	
# define BLOCK_IO_VERS	3		/* our existing version - incompatible interface change */
# define BLOCK_IO_REV	2		/* actual revision - compatible interface change */
	
	long	_cdecl (*config)	(const ushort drv, const long config, const long mode);
	
/* config: */
# define BIO_WP		1		/* configuring writeprotect feature */
# define BIO_WB		2		/* configuring writeback mode */
# define BIO_MAX_BLOCK	10		/* return maximum cacheable blocksize */
# define BIO_DEBUGLOG	100		/* only for debugging, kernel internal */
# define BIO_DEBUG_T	101		/* only for debugging, kernel internal */
	
	/* DI management */
	DI *	_cdecl (*get_di)	(ushort drv);
	DI *	_cdecl (*res_di)	(ushort drv);
	void	_cdecl (*free_di)	(DI *di);
	
	/* physical/logical mapping setting */
	void	_cdecl (*set_pshift)	(DI *di, ulong physical);
	void	_cdecl (*set_lshift)	(DI *di, ulong logical);
	
	/* cached block I/O */
	UNIT *	_cdecl (*lookup)	(DI *di, ulong sector, ulong blocksize);
	UNIT *	_cdecl (*getunit)	(DI *di, ulong sector, ulong blocksize);
	UNIT *	_cdecl (*read)		(DI *di, ulong sector, ulong blocksize);
	long	_cdecl (*write)		(UNIT *u);
	long	_cdecl (*l_read)	(DI *di, ulong sector, ulong blocks, ulong blocksize, void *buf);
	long	_cdecl (*l_write)	(DI *di, ulong sector, ulong blocks, ulong blocksize, const void *buf);
	
	/* optional feature */
	void	_cdecl (*pre_read)	(DI *di, ulong *sector, ulong blocks, ulong blocksize);
	
	/* synchronization */
	void	_cdecl (*lock)		(UNIT *u);
	void	_cdecl (*unlock)	(UNIT *u);
	
	/* update functions */
	void	_cdecl (*mark_modified)	(UNIT *u);
	void	_cdecl (*sync_drv)	(DI *di);
	
	/* cache management */
	long	_cdecl (*validate)	(DI *di, ulong maxblocksize);
	void	_cdecl (*invalidate)	(DI *di);
	
	/* revision 1 extension: resident block I/O
	 */
	UNIT *	_cdecl (*get_resident)	(DI *di, ulong sector, ulong blocksize);
	void	_cdecl (*rel_resident)	(UNIT *u);
	
	/* revision 2 extension: remove explicitly a cache unit without writing
	 * optional, never fail
	 */
	void	_cdecl (*remove)	(UNIT *u);
	
	
	long	res[3];			/* reserved for future */
};

The first thing is to check the block_IO version number. A different
version number means incompatible changes! If you recognise at module
startup a different version number you must terminate your xfs.

The revision number signals several interface compatible enhancements.


This description refers to version 3, revision 2 of the block_IO 
interface.
--------------------------------------------------------------

The interface is designed to make your life easier. It maps automatically
all calls through XHDI or BIOS for example. It's also possible to
cache non BIOS devices. The block_IO maps logical sizes to 
physical sizes automatically. Simple call set_lshift to 
specify the logical format.


Conditions of use:
------------------

- the xfs only calls the block_IO functions for data I/O
- the xfs is fully reentrant
- the xfs don't modify data structures of the block_IO
  module
- logical/physical translation only works for logical >= physical


All communications with the block_IO module goes through a so called
device identificator or DI:

typedef struct di DI;

/* device identificator */
struct di
{
	DI	*next;		/* internal: next in linked list */
	UNIT	**table;	/* internal: unit hash table */
	UNIT	*wb_queue;	/* internal: writeback queue */
	
	const ushort drv;	/* internal: BIOS device number (unique) */
	ushort	major;		/* XHDI */
	ushort	minor;		/* XHDI */
	ushort	mode;		/* internal: some flags */
	
# define BIO_WP_MODE	0x01	/* writeprotect bit (soft/hard) */
# define BIO_WB_MODE	0x02	/* writeback bit (soft) */
# define BIO_REMOVABLE	0x04	/* removable media */
# define BIO_LRECNO	0x10	/* lrecno supported */
	
	ulong	start;		/* physical start sector */
	ulong	size;		/* physical sectors */
	ulong	pssize;		/* internal: physical sector size */
	
	ushort	pshift;		/* internal: size to count calculation */
	ushort	lshift;		/* internal: logical to physical recno calculation */
	
	long	(*rwabs)(DI *di, ushort rw, void *buf, ulong size, ulong lrecno);
	long	(*dskchng)(DI *di);
	
	ushort	valid;		/* internal: DI valid */
	ushort	lock;		/* internal: DI in use */
	
	char	id[4];		/* partition id (GEM, BGM, RAW, \0D6, ...) */
	ushort	key;		/* XHDI key */
	
	char	res[18];	/* reserved for future */
};



2. DI handling
==============

The first thing to do is to get a DI. This is best placed in the root function of the xfs.
There are three functions for DI handling:

get_di():
---------
- 

return: - a valid DI
        - NULL if this DI is locked or not accessible through XHDI/BIOS

res_di():
---------
- reserves the DI, same as the previous function but 
  doesn't do anything except to lock the DI

- used for non-BIOS devices
- the xfs *must* fill out some data fields:
  start, size, pssize, rwabs, dskchng
- pshift & lshift must also be called for a successful 
  initialization

return: - valid DI
        - NULL if this DI is already locked (in use).

free_di():
----------
- unlock this DI, after this call the DI becomes invalid 
  and can't be used anymore

return: nothing


NOTE:
-----

After get/res_di() the DI for this device becomes locked 
and is never returned by get/res_di() until it is unlocked 
with free_di()

After get_di() logical to physical mapping is set to 1:1.
If you work with logical sizes you must call set_lshift to adjust the mapping.

After res_di() pssize is set to 512 and logical = physical.



3. logical/physical translation
===============================

set_pshift():
-------------
- sets physical sector size and adjusts shift values
  (shift values are used for fast calculations)

It's not recommended to use this function in combination with get_di()
because the physical sector size is automatically determined through XHDI.
It will also create problems with XHDI/BIOS rwabs() wrapper.

set_lshift():
-------------
- sets logical sector size and adjusts shift values

If you always work with groups of sectors you can specify 
this size.
For example, useful for TOS FAT filesystems that work with 
logical sector sizes and clusters. Also used by the 
MinixFS. MinixFS always works with blocks of 1024 bytes.

After this function, all block_IO calls map automatically 
the given parameter to physical parameter.


NOTE:
-----

pshift/lshift in the DI structure are very sensitive and important values.
A mistake here will directly cause problems on the 
corresponding device.
Bad written sectors for example.

Also start/size/pssize/pshift/lshift in the DI structure are used for
validation, cache consistency and so on. If you control 
those variables by yourself (non-BIOS device -> res_di()) 
those values must be right.

Never set pshift/lshift directly, always use the 
corresponding functions set_pshift() and set_lshift().

If you call set_pshift/set_lshift with incorrect parameters the
system is halted. There is no error recovery possible.


4. reading and writing
======================

lookup():
---------
- checks if a block is in the cache

return: - a ptr to the UNIT
        - NULL if the UNIT is not in cache

getunit():
----------
- allocates a new cache UNIT for the given startsector
- useful for write only data
- checks with lookup() if the UNIT is already in the cache

return: - a ptr to the new UNIT, the data area is not cleared
        - NULL if no free cache UNIT is found or any other error

read():
-------
- same as getunit but read the corresponding block
  into the UNIT
- checks with lookup() if the UNIT is already in the cache

return: - a ptr to the new UNIT
        - NULL for any error (read error, no free UNIT in cache)

write():
--------
- mark this UNIT as dirty in writeback mode
- write this UNIT back in writethrough mode

return: - E_OK or the Rwabs error number

l_read():
---------
- large read; reads a block directly to the buffer
- only useful for large blocks (to reduce I/O overhead)
- block_IO automatically syncs large transfers with existing
  cached units (cache consistency)

return: - E_OK or Rwabs error number

l_write():
----------
- large write; write a block directly from the buffer
- mostly useful for large blocks (to reduce I/O overhead)
- also cache consistence is guranted
- small blocks will automatically be buffered

return: - E_OK or Rwabs error number

pre_read():
-----------
- not implemented at the moment


NOTE:
-----

read/write/l_read/l_write/pre_read can block the active 
application until the transfer is done (background DMA). 
That's why your xfs must be reentrant.

A UNIT is valid until the *next* block_IO call. It's possible to lock
UNITs. It's not allowed that an interrupt handler call the block_IO
module. A taskswitch never occurs if the we are in kernel mode.



5. synchronization
==================

lock():
-------
- increments the lock counter for the UNIT

unlock():
---------
- decrements the lock counter


NOTE:
-----

A locked UNIT is never invalidated. Useful for open directories and such
things if pointer references left. But be careful, this 
slows down the search algorithm. Also the cache run out of 
free UNITS if there are a lot of locked UNITS. A locked 
UNIT must be unlocked, otherwise the memory is lost.



6. update
=========

mark_modified():
----------------
- marks a UNIT as modified; this action inserts the UNIT in 
the writeback queue but doesn't writeback anything
- if the UNIT is already marked no action is performed

return: nothing, always successful

sync_drv():
-----------
- writes back all dirty UNITS of the specified DI

return: nothing, always successful


NOTE:
-----

It's strongly recommended to first mark all modified UNITS 
as dirty and then write back all with sync_drv(). There is 
a write back optimization that will reduce a lot of I/O 
overhead in this case.

It's also strongly recommended to use the inline function: bio_MARK_MODIFIED()
instead of bio_mark_modified(). The inline function 
first checks if the UNIT is already marked and call 
mark_modified only if the UNIT is clean. This will
reduce function calls that are not necessary. Useful in write back mode.

Supporting user configurable Writeback mode is very easy. 
The only thing to do is to use the inline function 
bio_SYNC_DRV() instead of sync_drv().
bio_SYNC_DRV() checks if this drive is in WriteThrough 
mode, if yes it calls sync_drv, otherwise nothing happens 
(= WriteBack). Also Dcntl(V_CNTR_WB) must be supported.
Dcntl(V_CNTR_WB) only calls config() to change the 
writeback bit. Take a look in the MinixFS source for an 
example.

sync_drv() can also block the active application.



7. cache management
===================

validate():
-----------
- checks the given block size with the internal maximum
  block size limit

return: - E_OK if those blocks sizes are supported
        - ENSMEM if the block size is larger than the internal limit

invalidate():
-------------
- invalidates all cache UNITS for the given DI


NOTE:
-----

invalidate() does not free the DI, it only removes all 
cache UNITS of this DI.

invalidate() also removes all modified UNITS. Those UNITS 
are never written back by invalidate().


8. revision 1 extension: resident block I/O
===========================================

get_resident():
---------------
Instantiate a resident cached UNIT. The memory is *not* taken from
buffer cache. The UNIT is dynamically allocated through kmalloc().
Purpose is to cache long time needed blocks like superblock or whatever.
Implemented through kmalloc() to not pollute the buffer cache due to the
statically design.

return: - UNIT * on success, buffer cache is automatically synchronized

rel_resident():
---------------
Release a previously through get_resident() allocated UNIT. Only use
this function for UNIT that are allocated with get_resident().

return: - nothing
          halt the system on invalid UNIT parameter
          (assumes data corruption inside the xfs driver)


9. revision 2 extension
=======================

remove():
---------
Remove the cached UNIT without writing it back. Main purpose is for
truncate() optimization of modified but released UNITs.


9. helper
=========

config():
---------
- internal configuration and information:

return the maximum block size for config = BIO_MAX_BLOCK (10)

change WriteBack mode for the given drv if config = BIO_WB (2)
to mode (ENABLED/DISABLED)

