11.8 缓存 I/O

  通过使用输入输出缓存, 我们可以提高代码的效率。 我们可以建立一个输入缓存, 并一次读入一系列的字节。 然后, 我们再一个接一个地从缓存中提取它们。

  同样, 我们可以建立一个输出缓存。 把我们的输出存在里面,直到添满。 同时, 我们将让内核将缓存的内容写到标准输出 stdout 上。

  程序将在没有输入的时候结束。 但是我们仍然需要让内核再向标准输出 stdout 进行最后一次写操作, 否则一些内容将留在缓存中, 永不输出。 别忘记这个操作, 否则你将会困惑为什么你的程序丢失了一些应有的输出。

%include   'system.inc'

%define BUFSIZE 2048

section .data
hex db  '0123456789ABCDEF'

section .bss
ibuffer resb    BUFSIZE
obuffer resb    BUFSIZE

section .text
global  _start
_start:
    sub eax, eax
    sub ebx, ebx
    sub ecx, ecx
    mov edi, obuffer

.loop:
    ; read a byte from stdin
    call    getchar

    ; convert it to hex
    mov dl, al
    shr al, 4
    mov al, [hex+eax]
    call    putchar

    mov al, dl
    and al, 0Fh
    mov al, [hex+eax]
    call    putchar

    mov al, ' '
    cmp dl, 0Ah
    jne .put
    mov al, dl

.put:
    call    putchar
    jmp short .loop

align 4
getchar:
    or  ebx, ebx
    jne .fetch

    call    read

.fetch:
    lodsb
    dec ebx
    ret

read:
    push    dword BUFSIZE
    mov esi, ibuffer
    push    esi
    push    dword stdin
    sys.read
    add esp, byte 12
    mov ebx, eax
    or  eax, eax
    je  .done
    sub eax, eax
    ret

align 4
.done:
    call    write       ; flush output buffer
    push    dword 0
    sys.exit

align 4
putchar:
    stosb
    inc ecx
    cmp ecx, BUFSIZE
    je  write
    ret

align 4
write:
    sub edi, ecx    ; start of buffer
    push    ecx
    push    edi
    push    dword stdout
    sys.write
    add esp, byte 12
    sub eax, eax
    sub ecx, ecx    ; buffer is empty now
    ret

  现在我们的程序有了第三个部分,名字叫 .bss。 这个部分不会包含在我们可执行文件里, 因此不会被初始化。 我们需要用 resb 代替 db。 它仅仅为我们保留了指定大小的未初始化内存。

  We take advantage of the fact that the system does not modify the registers: We use registers for what, otherwise, would have to be global variables stored in the .data section. This is also why the UNIX® convention of passing parameters to system calls on the stack is superior to the Microsoft convention of passing them in the registers: We can keep the registers for our own use.

  We use EDI and ESI as pointers to the next byte to be read from or written to. We use EBX and ECX to keep count of the number of bytes in the two buffers, so we know when to dump the output to, or read more input from, the system.

  Let us see how it works now:

% nasm -f elf hex.asm
% ld -s -o hex hex.o
% ./hex
Hello, World!
Here I come!
48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21 0A
48 65 72 65 20 49 20 63 6F 6D 65 21 0A
^D %

  Not what you expected? The program did not print the output until we pressed ^D. That is easy to fix by inserting three lines of code to write the output every time we have converted a new line to 0A. I have marked the three lines with > (do not copy the > in your hex.asm).

%include   'system.inc'

%define BUFSIZE 2048

section .data
hex db  '0123456789ABCDEF'

section .bss
ibuffer resb    BUFSIZE
obuffer resb    BUFSIZE

section .text
global  _start
_start:
    sub eax, eax
    sub ebx, ebx
    sub ecx, ecx
    mov edi, obuffer

.loop:
    ; read a byte from stdin
    call    getchar

    ; convert it to hex
    mov dl, al
    shr al, 4
    mov al, [hex+eax]
    call    putchar

    mov al, dl
    and al, 0Fh
    mov al, [hex+eax]
    call    putchar

    mov al, ' '
    cmp dl, 0Ah
    jne .put
    mov al, dl

.put:
    call    putchar
>   cmp al, 0Ah
>   jne .loop
>   call    write
    jmp short .loop

align 4
getchar:
    or  ebx, ebx
    jne .fetch

    call    read

.fetch:
    lodsb
    dec ebx
    ret

read:
    push    dword BUFSIZE
    mov esi, ibuffer
    push    esi
    push    dword stdin
    sys.read
    add esp, byte 12
    mov ebx, eax
    or  eax, eax
    je  .done
    sub eax, eax
    ret

align 4
.done:
    call    write       ; flush output buffer
    push    dword 0
    sys.exit

align 4
putchar:
    stosb
    inc ecx
    cmp ecx, BUFSIZE
    je  write
    ret

align 4
write:
    sub edi, ecx    ; start of buffer
    push    ecx
    push    edi
    push    dword stdout
    sys.write
    add esp, byte 12
    sub eax, eax
    sub ecx, ecx    ; buffer is empty now
    ret

  Now, let us see how it works:

% nasm -f elf hex.asm
% ld -s -o hex hex.o
% ./hex
Hello, World!
48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21 0A
Here I come!
48 65 72 65 20 49 20 63 6F 6D 65 21 0A
^D %

  Not bad for a 644-byte executable, is it!

注意: This approach to buffered input/output still contains a hidden danger. I will discuss──and fix──it later, when I talk about the dark side of buffering.

11.8.1 How to Unread a Character

警告: This may be a somewhat advanced topic, mostly of interest to programmers familiar with the theory of compilers. If you wish, you may skip to the next section, and perhaps read this later.

  While our sample program does not require it, more sophisticated filters often need to look ahead. In other words, they may need to see what the next character is (or even several characters). If the next character is of a certain value, it is part of the token currently being processed. Otherwise, it is not.

  For example, you may be parsing the input stream for a textual string (e.g., when implementing a language compiler): If a character is followed by another character, or perhaps a digit, it is part of the token you are processing. If it is followed by white space, or some other value, then it is not part of the current token.

  This presents an interesting problem: How to return the next character back to the input stream, so it can be read again later?

  One possible solution is to store it in a character variable, then set a flag. We can modify getchar to check the flag, and if it is set, fetch the byte from that variable instead of the input buffer, and reset the flag. But, of course, that slows us down.

  The C language has an ungetc() function, just for that purpose. Is there a quick way to implement it in our code? I would like you to scroll back up and take a look at the getchar procedure and see if you can find a nice and fast solution before reading the next paragraph. Then come back here and see my own solution.

  The key to returning a character back to the stream is in how we are getting the characters to start with:

  First we check if the buffer is empty by testing the value of EBX. If it is zero, we call the read procedure.

  If we do have a character available, we use lodsb, then decrease the value of EBX. The lodsb instruction is effectively identical to:

   mov al, [esi]
    inc esi

  The byte we have fetched remains in the buffer until the next time read is called. We do not know when that happens, but we do know it will not happen until the next call to getchar. Hence, to "return" the last-read byte back to the stream, all we have to do is decrease the value of ESI and increase the value of EBX:

ungetc:
    dec esi
    inc ebx
    ret

  But, be careful! We are perfectly safe doing this if our look-ahead is at most one character at a time. If we are examining more than one upcoming character and call ungetc several times in a row, it will work most of the time, but not all the time (and will be tough to debug). Why?

  Because as long as getchar does not have to call read, all of the pre-read bytes are still in the buffer, and our ungetc works without a glitch. But the moment getchar calls read, the contents of the buffer change.

  We can always rely on ungetc working properly on the last character we have read with getchar, but not on anything we have read before that.

  If your program reads more than one byte ahead, you have at least two choices:

  If possible, modify the program so it only reads one byte ahead. This is the simplest solution.

  If that option is not available, first of all determine the maximum number of characters your program needs to return to the input stream at one time. Increase that number slightly, just to be sure, preferably to a multiple of 16──so it aligns nicely. Then modify the .bss section of your code, and create a small "spare" buffer right before your input buffer, something like this:

section    .bss
    resb    16  ; or whatever the value you came up with
ibuffer resb    BUFSIZE
obuffer resb    BUFSIZE

  You also need to modify your ungetc to pass the value of the byte to unget in AL:

ungetc:
    dec esi
    inc ebx
    mov [esi], al
    ret

  With this modification, you can call ungetc up to 17 times in a row safely (the first call will still be within the buffer, the remaining 16 may be either within the buffer or within the "spare").

本文档和其它文档可从这里下载:ftp://ftp.FreeBSD.org/pub/FreeBSD/doc/.

如果对于FreeBSD有问题,请先阅读文档,如不能解决再联系<questions@FreeBSD.org>.
关于本文档的问题请发信联系 <doc@FreeBSD.org>.