Filesystem20 Nov 2016 • Leave Comments
This post talks about filesystem operations under Linux environment.
- Filename is a string encoded and stored in fileystem.
- Linux kernel first decodes filename stored on disk (more specific, in filesystem on disk), then encodes it again, and finally submit to user space applications (like ls, thuar etc.).
- When it comes to reading filenames on disk,
- Windows NT kernel reads two bytes at a time. Everything in Windows NT kernel is UCS-2.
Linux kernel reads byte by byte, which means it recoginizes only variable length encoding streams, single byte encoding (i.e. cp437) included.
The possible encoding streams are listed in Native Language Support (NLS) under File Systems in make menuconfig. But be careful, this NLS (in kernel space) is different from that one (nls USE and sytem locale) in user space.
- FAT is not a separate filesystem, but a common part of the MSDOS, UMSDOS and VFAT.
- Mount options of FAT applies to MSDOS, UMSDOS and VFAT either.
- Attention, they are Linux filesystem drivers, not the original Windows FAT filesystems.
- You might encounter two confusing options namely codepage and iocharset. Without proper setting, you would see garbled filenames.
- To be simple, codepage encodes (on creation) and decodes (on display) shortname while iocharset should be the system locale.
- No matter of codepage or iocharset, it's used by kernel instead of application.
- The codepage for display should be the same as that of creation, otherwise it's decoded as garbage.
- Details are a long story as follows:
shortname and longname
- Shortname is a concept belongs to MSDOS only.
- Looks like 8.3. The former is at most 8-byte long and the extension occupies 3 bytes.
- Case insensitive. Actually only allow upper case characters.
Upon VFAT, long filename is supported, abandoning that two limitations.
In order to be compatible with MSDOS, each filename has a shortname version. We call the original one as longname. That is to say, VFAT stores two filenames in filesystem like:
longname; shortname, codepage
Shortname is encoded as codepage (i.e. 936 for Simplified Chinese), while longname is encoded as Unicode (NOT UTF-8).
Actually, most modern fileystems encode filenames as Unicode.
When it comes to mount, we talk about mounting FAT (MSDOS VFAT) under Linux.
- codepage option. Kernel uses it to decode shortname and then translates it into Unicode. Longname is Unicode by default.
iocharset option. Kernel uses it to encode the Unicode shortname (MSDOS) or longname (VFAT). Then the encoded stream is passed to use space.
There is a special iocharset value utf8. You should set it in a separate way (discussed next).
- Upon receiving the stream, application decodes it with the system locale (nls in user space).
Set the correct value
Suppose we have a VFAT fileystem initialized in Windows GBK sysetem. Now it will be mounted and used under Linux.
Shortname's codepage is cp936 while longname is Unicode.
- Set codepage=936 mount option (without prefix cp).
- If you want to see the shortname, use -t msdos instead of -t vfat. Meanwhile, iocharset is ignored.
- Actually, we rarely use MSDOS nowadays. If you don't care, just leave it the default (in kernel).
- iocharset= depends on system locale xx_YY.ZZ:
- If ZZ is Chinese encoding like GB2312, GBK, GB18030, then set iocharset=cp936 (with prefix cp).
- If ZZ is UTF-8, DO NOT set iocharset=utf8! Instead, use utf8 alone while keeping the default iocharset value.
If iocharset=utf8, the kernel vfat module allows lower case shortname which conflicts with Windows's tradition.
The kernel will throws a warning on such a case.
- Remember the relevant NLS is compiled as Y or M.
- The default codepage and iocharset (maybe utf8) value can be set in kernel.
- Shortname has nothing to do with NTFS. There is not such mount option as codepage.
- The iocharset option of NTFS has changed to nls.
Similarly, if nls=utf8, you can use ntf8 alone.
There does not exist lower/upper case filename issue with nls=utf8.