Руководство по assembler - RukovodstvoRus.ru - инструкции пользования и руководства

Время на прочтение
16 мин

Количество просмотров 123K

В наше время редко возникает необходимость писать на чистом ассемблере, но я определённо рекомендую это всем, кто интересуется программированием. Вы увидите вещи под иным углом, а навыки пригодятся при отладке кода на других языках.

В этой статье мы напишем с нуля калькулятор обратной польской записи (RPN) на чистом ассемблере x86. Когда закончим, то сможем использовать его так:

$ ./calc "32+6*" # "(3+2)*6" в инфиксной нотации
30

Весь код для статьи здесь. Он обильно закомментирован и может служить учебным материалом для тех, кто уже знает ассемблер.

Предполагается, что у читателя есть некоторый опыт программирования на C и базовые знания компьютерной архитектуры (например, что такое регистр процессора). Поскольку мы будем использовать Linux, вы также должны уметь использовать командную строку Linux.

Настройка среды

Как уже сказано, мы используем Linux (64- или 32-битный). Приведённый код не работает в Windows или Mac OS X.

Для установки нужен только компоновщик GNU ld из binutils, который предварительно установлен в большинстве дистрибутивов, и ассемблер NASM. На Ubuntu и Debian можете установить и то, и другое одной командой:

$ sudo apt-get install binutils nasm

Я бы также рекомендовал держать под рукой таблицу ASCII.

Hello, world!

Для проверки среды сохраните следующий код в файле calc.asm:

; Компоновщик находит символ _start и начинает выполнение программы
; отсюда.
global _start

; В разделе .rodata хранятся константы (только для чтения)
; Порядок секций не имеет значения, но я люблю ставить её вперёд
section .rodata
    ; Объявляем пару байтов как hello_world. Псевдоинструкция базы NASM 
    ; допускает однобайтовое значение, строковую константу или их сочетание,
    ; как здесь. 0xA = новая строка, 0x0 = нуль окончания строки
    hello_world: db "Hello world!", 0xA, 0x0

; Начало секции .text, где находится код программы
section .text
_start:
    mov eax, 0x04           ; записать число 4 в регистр eax (0x04 = write())
    mov ebx, 0x1            ; дескриптор файла (1 = стандартный вывод, 2 = стандартная ошибка)
    mov ecx, hello_world    ; указатель на выводимую строку
    mov edx, 14             ; длина строки
    int 0x80                ; отправляем сигнал прерывания 0x80, который ОС
                            ;   интерпретирует как системный вызов

    mov eax, 0x01           ; 0x01 = exit()
    mov ebx, 0              ; 0 = нет ошибок
    int 0x80

Комментарии объясняют общую структуру. Список регистров и общих инструкций можете изучить в «Руководстве по ассемблеру x86 университета Вирджинии». При дальнейшем обсуждении системных вызовов это тем более понадобится.

Следующие команды собирают файл ассемблера в объектный файл, а затем компонует исполняемый файл:

$ nasm -f elf_i386 calc.asm -o calc
$ ld -m elf_i386 calc.o -o calc

После запуска вы должны увидеть:

$ ./calc
Hello world!

Makefile

Это необязательная часть, но для упрощения сборки и компоновки в будущем можно сделать Makefile. Сохраните его в том же каталоге, что и calc.asm:

CFLAGS= -f elf32
LFLAGS= -m elf_i386

all: calc

calc: calc.o
	ld $(LFLAGS) calc.o -o calc

calc.o: calc.asm
	nasm $(CFLAGS) calc.asm -o calc.o

clean:
	rm -f calc.o calc
        
.INTERMEDIATE: calc.o

Затем вместо вышеприведённых инструкций просто запускаем make.

Системные вызовы

Системные вызовы Linux указывают ОС выполнить для нас какие-то действия. В этой статье мы используем только два системных вызова: write() для записи строки в файл или поток (в нашем случае это стандартное устройство вывода и стандартная ошибка) и exit() для выхода из программы:

syscall 0x01: exit(int error_code)
  error_code - используем 0 для выхода без ошибок и любые другие значения (такие как 1) для ошибок
syscall 0x04: write(int fd, char *string, int length)
  fd — используем 1 для стандартного вывода, 2 для стандартного потока вывода ошибок
  string — указатель на первый символ строки
  length — длина строки в байтах

Системные вызовы настраиваются путём сохранения номера системного вызова в регистре eax, а затем его аргументов в ebx, ecx, edx в таком порядке. Можете заметить, что у exit() только один аргумент — в этом случае ecx и edx не имеют значения.

eax	ebx	ecx	edx
Номер системного вызова	arg1	arg2	arg3

Стек вызовов

Стек вызовов — структура данных, в которой хранится информация о каждом обращении к функции. У каждого вызова собственный раздел в стеке — «фрейм». Он хранит некоторую информацию о текущем вызове: локальные переменные этой функции и адрес возврата (куда программа должна перейти после выполнения функции).

Сразу отмечу одну неочевидную вещь: стек увеличивается вниз по памяти. Когда вы добавляете что-то на верх стека, оно вставляется по адресу памяти ниже, чем предыдущий элемент. Другими словами, по мере роста стека адрес памяти в верхней части стека уменьшается. Чтобы избежать путаницы, я буду всё время напоминать об этом факте.

Инструкция push заносит что-нибудь на верх стека, а pop уносит данные оттуда. Например, push еах выделяет место наверху стека и помещает туда значение из регистра eax, а pop еах переносит любые данные из верхней части стека в eax и освобождает эту область памяти.

Цель регистра esp — указать на вершину стека. Любые данные выше esp считаются не попавшими в стек, это мусорные данные. Выполнение инструкции push (или pop) перемещает esp. Вы можете манипулировать esp и напрямую, если отдаёте отчёт своим действиям.

Регистр ebp похож на esp, только он всегда указывает примерно на середину текущего кадра стека, непосредственно перед локальными переменными текущей функции (поговорим об этом позже). Однако вызов другой функции не перемещает ebp автоматически, это нужно каждый раз делать вручную.

Соглашение о вызовах для архитектуры x86

В х86 нет встроенного понятия функции как в высокоуровневых языках. Инструкция call — это по сути просто jmp (goto) в другой адрес памяти. Чтобы использовать подпрограммы как функции в других языках (которые могут принимать аргументы и возвращать данные обратно), нужно следовать соглашению о вызовах (существует много конвенций, но мы используем CDECL, самое популярное соглашение для x86 среди компиляторов С и программистов на ассемблере). Это также гарантирует, что регистры подпрограммы не перепутаются при вызове другой функции.

Правила вызывающей стороны

Перед вызовом функции вызывающая сторона должна:

Сохранить в стек регистры, которые обязан сохранять вызывающий. Вызываемая функция может изменить некоторые регистры: чтобы не потерять данные, вызывающая сторона должна сохранить их в памяти до помещения в стек. Речь идёт о регистрах eax, ecx и edx. Если вы не используете какие-то из них, то их можно не сохранять.
Записать аргументы функции на стек в обратном порядке (сначала последний аргумент, в конце первый аргумент). Такой порядок гарантирует, что вызываемая функция получит из стека свои аргументы в правильном порядке.
Вызвать подпрограмму.

По возможности функция сохранит результат в eax. Сразу после call вызывающая сторона должна:

Удалить из стека аргументы функции. Обычно это делается путём простого добавления числа байтов в esp. Не забывайте, что стек растёт вниз, поэтому для удаления из стека необходимо добавить байты.
Восстановить сохранённые регистры, забрав их из стека в обратном порядке инструкцией pop. Вызываемая функция не изменит никакие другие регистры.

Следующий пример демонстрирует, как применяются эти правила. Предположим, что функция _subtract принимает два целочисленных (4-байтовых) аргумента и возвращает первый аргумент за вычетом второго. В подпрограмме _mysubroutine вызываем _subtract с аргументами 10 и 2:

_mysubroutine:
    ; ...
    ; здесь какой-то код
    ; ...
    push ecx       ; сохраняем регистры (я решил не сохранять eax)
    push edx
    push 2         ; второе правило, пушим аргументы в обратном порядке
    push 10
    call _subtract ; eax теперь равен 10-2=8
    add esp, 8     ; удаляем 8 байт со стека (два аргумента по 4 байта)
    pop edx        ; восстанавливаем сохранённые регистры
    pop ecx
    ; ...
    ; ещё какой-то код, где я использую удивительно полезное значение из eax
    ; ...

Правила вызываемой подпрограммы

Перед вызовом подпрограмма должна:

Сохранить указатель базового регистра ebp предыдущего фрейма, записав его на стек.
Отрегулировать ebp с предыдущего фрейма на текущий (текущее значение esp).
Выделить больше места в стеке для локальных переменных, при необходимости переместить указатель esp. Поскольку стек растёт вниз, нужно вычесть недостающую память из esp.
Сохранить в стек регистры вызываемой подпрограммы. Это ebx, edi и esi. Необязательно сохранять регистры, которые не планируется изменять.

Стек вызовов после шага 1:

Стек вызовов после шага 2:

Стек вызовов после шага 4:

На этих диаграммах в каждом стековом фрейме указан адрес возврата. Его автоматически вставляет в стек инструкция call. Инструкция ret извлекает адрес с верхней части стека и переходит на него. Эта инструкция нам не нужна, я просто показал, почему локальные переменные функции находятся на 4 байта выше ebp, но аргументы функции — на 8 байт ниже ebp.

На последней диаграмме также можно заметить, что локальные переменные функции всегда начинается на 4 байта выше ebp с адреса ebp-4 (здесь вычитание, потому что мы двигаемся вверх по стеку), а аргументы функции всегда начинается на 8 байт ниже ebp с адреса ebp+8 (сложение, потому что мы двигаемся вниз по стеку). Если следовать правилам из этой конвенции, так будет c переменными и аргументами любой функции.

Когда функция выполнена и вы хотите вернуться, нужно сначала установить eax на возвращаемое значение функции, если это необходимо. Кроме того, нужно:

Восстановить сохранённые регистры, вынеся их из стека в обратном порядке.
Освободить место в стеке, выделенное локальным переменным на шаге 3, если необходимо: делается простой установкой esp в ebp
Восстановить указатель базы ebp предыдущего фрейма, вынеся его из стека.
Вернуться с помощью ret

Теперь реализуем функцию _subtract из нашего примера:

_subtract:
    push ebp           ; сохранение указателя базы предыдущего фрейма
    mov ebp, esp        ; настройка ebp
    ; Здесь я бы выделил место на стеке для локальных переменных, но они мне не нужны
    ; Здесь я бы сохранил регистры вызываемой подпрограммы, но я ничего не
    ; собираюсь изменять
    ; Тут начинается функция
    mov eax, [ebp+8]    ; копирование первого аргумента функции в eax. Скобки
                        ; означают доступ к памяти по адресу ebp+8
    sub eax, [ebp+12]   ; вычитание второго аргумента по адресу ebp+12 из первого 
                        ; аргумента
    ; Тут функция заканчивается, eax равен её возвращаемому значению
    ; Здесь я бы восстановил регистры, но они не сохранялись
    ; Здесь я бы освободил стек от переменных, но память для них не выделялась
    pop ebp             ; восстановление указателя базы предыдущего фрейма
    ret

Вход и выход

В приведённом примере вы можете заметить, что функция всегда запускается одинаково: push ebp, mov ebp, esp и выделение памяти для локальных переменных. В наборе x86 есть удобная инструкция, которая всё это выполняет: enter a b, где a — количество байт, которые вы хотите выделить для локальных переменных, b — «уровень вложенности», который мы всегда будем выставлять на 0. Кроме того, функция всегда заканчивается инструкциями pop ebp и mov esp, ebp (хотя они необходимы только при выделении памяти для локальных переменных, но в любом случае не причиняют вреда). Это тоже можно заменить одной инструкцией: leave. Вносим изменения:

_subtract:
    enter 0, 0            ; сохранение указателя базы предыдущего фрейма и настройка ebp
    ; Здесь я бы сохранил регистры вызываемой подпрограммы, но я ничего не 
    ; собираюсь изменять
    ; Тут начинается функция
    mov eax, [ebp+8]    ; копирование первого аргумента функции в eax. Скобки
                        ; означают доступ к памяти по адресу ebp+8
    sub eax, [ebp+12]   ; вычитание второго аргумента по адресу ebp+12 из 
                        ; первого аргумента
    ; Тут функция заканчивается, eax равен её возвращаемому значению
    ; Здесь я бы восстановил регистры, но они не сохранялись
    leave              ; восстановление указателя базы предыдущего фрейма
    ret

Написание некоторых основных функций

Усвоив соглашение о вызовах, можно приступить к написанию некоторых подпрограмм. Почему бы не обобщить код, который выводит «Hello world!», для вывода любых строк: функция _print_msg.

Здесь понадобится ещё одна функция _strlen для подсчёта длины строки. На C она может выглядеть так:

size_t strlen(char *s) {
    size_t length = 0;
    while (*s != 0)
    {           // начало цикла
        length++;
        s++;
    }           // конец цикла
    return length;
}

Другими словами, с самого начала строки мы добавляем 1 к возвращаемым значением для каждого символа, кроме нуля. Как только замечен нулевой символ, возвращаем накопленное в цикле значение. В ассемблере это тоже довольно просто: можно использовать как базу ранее написанную функцию _subtract:

_strlen:
    enter 0, 0          ; сохраняем указатель базы предыдущего фрейма и настраиваем ebp
    ; Здесь я бы сохранил регистры вызываемой подпрограммы, но я ничего не 
    ; собираюсь изменять
    ; Здесь начинается функция
    mov eax, 0          ; length = 0
    mov ecx, [ebp+8]    ; первый аргумент функции (указатель на первый
                        ; символ строки) копируется в ecx (его сохраняет вызывающая 
                        ; сторона, так что нам нет нужды сохранять)
_strlen_loop_start:     ; это метка, куда можно перейти
    cmp byte [ecx], 0   ; разыменование указателя и сравнение его с нулём. По
                        ; умолчанию память считывается по 32 бита (4 байта).
                        ; Иное нужно указать явно. Здесь мы указываем
                        ; чтение только одного байта (один символ)
    je _strlen_loop_end ; выход из цикла при появлении нуля
    inc eax             ; теперь мы внутри цикла, добавляем 1 к возвращаемому значению
    add ecx, 1          ; переход к следующему символу в строке
    jmp _strlen_loop_start  ; переход обратно к началу цикла
_strlen_loop_end:
    ; Здесь функция заканчивается, eax равно возвращаемому значению
    ; Здесь я бы восстановил регистры, но они не сохранялись
    leave               ; восстановление указателя базы предыдущего фрейма
    ret

Уже неплохо, верно? Сначала написать код на C может помочь, потому что большая его часть непосредственно преобразуется в ассемблер. Теперь можно использовать эту функцию в _print_msg, где мы применим все полученные знания:

_print_msg:
    enter 0, 0
    ; Здесь начинается функция
    mov eax, 0x04       ; 0x04 = системный вызов write()
    mov ebx, 0x1        ; 0x1 = стандартный вывод
    mov ecx, [ebp+8]    ; мы хотим вывести первый аргумент этой функции,
    ; сначала установим edx на длину строки. Пришло время вызвать _strlen
    push eax            ; сохраняем регистры вызываемой функции (я решил не сохранять edx)
    push ecx       
    push dword [ebp+8]  ; пушим аргумент _strlen в _print_msg. Здесь NASM
                        ; ругается, если не указать размер, не знаю, почему.
                        ; В любом случае указателем будет dword (4 байта, 32 бита)
    call _strlen        ; eax теперь равен длине строки
    mov edx, eax        ; перемещаем размер строки в edx, где он нам нужен
    add esp, 4          ; удаляем 4 байта со стека (один 4-байтовый аргумент char*)
    pop ecx             ; восстанавливаем регистры вызывающей стороны
    pop eax
    ; мы закончили работу с функцией _strlen, можно инициировать системный вызов
    int 0x80
    leave
    ret

И посмотрим плоды нашей тяжёлой работы, используя эту функцию в полной программе “Hello, world!”.

_start:
    enter 0, 0
    ; сохраняем регистры вызывающей стороны (я решил никакие не сохранять)
    push hello_world    ; добавляем аргумент для _print_msg
    call _print_msg
    mov eax, 0x01           ; 0x01 = exit()
    mov ebx, 0              ; 0 = без ошибок
    int 0x80

Хотите верьте, хотите нет, но мы рассмотрели все основные темы, которые нужны для написания базовых программ на ассемблере x86! Теперь у нас есть весь вводный материал и теория, так что полностью сосредоточимся на коде и применим полученные знания для написания нашего калькулятора RPN. Функции будут намного длиннее и даже станут использовать некоторые локальные переменные. Если хотите сразу увидеть готовую программу, вот она.

Для тех из вас, кто не знаком с обратной польской записью (иногда называемой обратной польской нотацией или постфиксной нотацией), то здесь выражения вычисляются с помощью стека. Поэтому нужно создать стек, а также функции _pop и _push для манипуляций с этим стеком. Понадобится ещё функция _print_answer, которая выведет в конце вычислений строковое представление числового результата.

Создание стека

Сначала определим для нашего стека пространство в памяти, а также глобальную переменную stack_size. Желательно изменить эти переменные так, чтобы они попали не в раздел .rodata, а в .data.

section .data
    stack_size: dd 0        ; создаём переменную dword (4 байта) со значением 0
    stack: times 256 dd 0   ; заполняем стек нулями

Теперь можно реализовать функции _push и _pop:

_push:
    enter 0, 0
    ; Сохраняем регистры вызываемой функции, которые будем использовать
    push eax
    push edx
    mov eax, [stack_size]
    mov edx, [ebp+8]
    mov [stack + 4*eax], edx    ; Заносим аргумент на стек. Масштабируем по
                                ; четыре байта в соответствии с размером dword
    inc dword [stack_size]      ; Добавляем 1 к stack_size
    ; Восстанавливаем регистры вызываемой функции
    pop edx
    pop eax
    leave
    ret

_pop:
    enter 0, 0
    ;  Сохраняем регистры вызываемой функции
    dec dword [stack_size]      ; Сначала вычитаем 1 из stack_size
    mov eax, [stack_size]
    mov eax, [stack + 4*eax]    ; Заносим число на верх стека в eax
    ; Здесь я бы восстановил регистры, но они не сохранялись
    leave
    ret

Вывод чисел

_print_answer намного сложнее: придётся конвертировать числа в строки и использовать несколько других функций. Понадобится функция _putc, которая выводит один символ, функция mod для вычисления остатка от деления (модуля) двух аргументов и _pow_10 для возведения в степень 10. Позже вы поймёте, зачем они нужны. Это довольно просто, вот код:

_pow_10:
    enter 0, 0
    mov ecx, [ebp+8]    ; задаёт ecx (сохранённый вызывающей стороной) аргументом 
                        ; функции
    mov eax, 1          ; первая степень 10 (10**0 = 1)
_pow_10_loop_start:     ; умножает eax на 10, если ecx не равно 0
    cmp ecx, 0
    je _pow_10_loop_end
    imul eax, 10
    sub ecx, 1
    jmp _pow_10_loop_start
_pow_10_loop_end:
    leave
    ret

_mod:
    enter 0, 0
    push ebx
    mov edx, 0          ; объясняется ниже
    mov eax, [ebp+8]
    mov ebx, [ebp+12]
    idiv ebx            ; делит 64-битное целое [edx:eax] на ebx. Мы хотим поделить
                        ; только 32-битное целое eax, так что устанавливаем edx равным 
                        ; нулю.
                        ; частное сохраняем в eax, остаток в edx. Как обычно, получить 
                        ; информацию по конкретной инструкции можно из справочников, 
                        ; перечисленных в конце статьи.
    mov eax, edx        ; возвращает остаток от деления (модуль)
    pop ebx
    leave
    ret

_putc:
    enter 0, 0
    mov eax, 0x04       ; write()
    mov ebx, 1          ; стандартный вывод
    lea ecx, [ebp+8]    ; входной символ
    mov edx, 1          ; вывести только 1 символ
    int 0x80
    leave
    ret

Итак, как мы выводим отдельные цифры в числе? Во-первых, обратите внимание, что последняя цифра числа равна остатку от деления на 10 (например, 123 % 10 = 3), а следующая цифра — это остаток от деления на 100, поделенный на 10 (например, (123 % 100)/10 = 2). В общем, можно найти конкретную цифру числа (справа налево), найдя (число % 10**n) / 10**(n-1), где число единиц будет равно n = 1, число десятков n = 2 и так далее.

Используя это знание, можно найти все цифры числа с n = 1 до n = 10 (это максимальное количество разрядов в знаковом 4-байтовом целом). Но намного проще идти слева направо — так мы сможем печатать каждый символ, как только находим его, и избавиться от нулей в левой части. Поэтому перебираем числа от n = 10 до n = 1.

На C программа будет выглядеть примерно так:

#define MAX_DIGITS 10
void print_answer(int a) {
    if (a < 0) { // если число отрицательное
        putc('-'); // вывести знак «минус»
        a = -a; // преобразовать в положительное число
    }
    int started = 0;
    for (int i = MAX_DIGITS; i > 0; i--) {
        int digit = (a % pow_10(i)) / pow_10(i-1);
        if (digit == 0 && started == 0) continue; // не выводить лишние нули
        started = 1;
        putc(digit + '0');
    }
}

Теперь вы понимаете, зачем нам эти три функции. Давайте реализуем это на ассемблере:

%define MAX_DIGITS 10

_print_answer:
    enter 1, 0              ; используем 1 байт для переменной "started" в коде C
    push ebx
    push edi
    push esi
    mov eax, [ebp+8]        ; наш аргумент "a"
    cmp eax, 0              ; если число не отрицательное, пропускаем этот условный 
                            ; оператор
    jge _print_answer_negate_end
    ; call putc for '-'
    push eax
    push 0x2d               ; символ '-'
    call _putc
    add esp, 4
    pop eax
    neg eax                 ; преобразуем в положительное число
_print_answer_negate_end:
    mov byte [ebp-4], 0     ; started = 0
    mov ecx, MAX_DIGITS     ; переменная i
_print_answer_loop_start:
    cmp ecx, 0
    je _print_answer_loop_end
    ; вызов pow_10 для ecx. Попытаемся сделать ebx как переменную "digit" в коде C.
    ; Пока что назначим edx = pow_10(i-1), а ebx = pow_10(i)
    push eax
    push ecx
    dec ecx             ; i-1
    push ecx            ; первый аргумент для _pow_10
    call _pow_10
    mov edx, eax        ; edx = pow_10(i-1)
    add esp, 4
    pop ecx             ; восстанавливаем значение i для ecx
    pop eax
    ; end pow_10 call
    mov ebx, edx        ; digit = ebx = pow_10(i-1)
    imul ebx, 10        ; digit = ebx = pow_10(i)
    ; вызываем _mod для (a % pow_10(i)), то есть (eax mod ebx)
    push eax
    push ecx
    push edx
    push ebx            ; arg2, ebx = digit = pow_10(i)
    push eax            ; arg1, eax = a
    call _mod
    mov ebx, eax        ; digit = ebx = a % pow_10(i+1), almost there
    add esp, 8
    pop edx
    pop ecx
    pop eax
    ; завершение вызова mod
    ; делим ebx (переменная "digit" ) на pow_10(i) (edx). Придётся сохранить пару 
    ; регистров, потому что idiv использует для деления и edx, eax. Поскольку 
    ; edx является нашим делителем, переместим его в какой-нибудь 
    ; другой регистр
    push esi
    mov esi, edx
    push eax
    mov eax, ebx
    mov edx, 0
    idiv esi            ; eax хранит результат (цифру)
    mov ebx, eax        ; ebx = (a % pow_10(i)) / pow_10(i-1), переменная "digit" в коде C
    pop eax
    pop esi
    ; end division
    cmp ebx, 0                        ; если digit == 0
    jne _print_answer_trailing_zeroes_check_end
    cmp byte [ebp-4], 0               ; если started == 0
    jne _print_answer_trailing_zeroes_check_end
    jmp _print_answer_loop_continue   ; continue
_print_answer_trailing_zeroes_check_end:
    mov byte [ebp-4], 1     ; started = 1
    add ebx, 0x30           ; digit + '0'
    ; вызов putc
    push eax
    push ecx
    push edx
    push ebx
    call _putc
    add esp, 4
    pop edx
    pop ecx
    pop eax
    ; окончание вызова putc
_print_answer_loop_continue:
    sub ecx, 1
    jmp _print_answer_loop_start
_print_answer_loop_end:
    pop esi
    pop edi
    pop ebx
    leave
    ret

Это было тяжкое испытание! Надеюсь, комментарии помогают разобраться. Если вы сейчас думаете: «Почему нельзя просто написать printf("%d")?», то вам понравится окончание статьи, где мы заменим функцию именно этим!

Теперь у нас есть все необходимые функции, осталось реализовать основную логику в _start — и на этом всё!

Вычисление обратной польской записи

Как мы уже говорили, обратная польская запись вычисляется с помощью стека. При чтении число заносится на стек, а при чтении оператор применяется к двум объектам наверху стека.

Например, если мы хотим вычислить 84/3+6* (это выражение также можно записать в виде 6384/+*), процесс выглядит следующим образом:

Шаг	Символ	Стек перед	Стек после
1	`8`	`[]`	`[8]`
2	`4`	`[8]`	`[8, 4]`
3	`/`	`[8, 4]`	`[2]`
4	`3`	`[2]`	`[2, 3]`
5	`+`	`[2, 3]`	`[5]`
6	`6`	`[5]`	`[5, 6]`
7	`*`	`[5, 6]`	`[30]`

Если на входе допустимое постфиксное выражение, то в конце вычислений на стеке остаётся лишь один элемент — это и есть ответ, результат вычислений. В нашем случае число равно 30.

В ассемблере нужно реализовать нечто вроде такого кода на C:

int stack[256];         // наверное, 256 слишком много для нашего стека
int stack_size = 0;

int main(int argc, char *argv[]) {
    char *input = argv[0];
    size_t input_length = strlen(input);
    
    for (int i = 0; i < input_length; i++) {
        char c = input[i];
        if (c >= '0' && c <= '9') { // если символ — это цифра
            push(c - '0'); // преобразовать символ в целое число и поместить в стек
        } else {
            int b = pop();
            int a = pop();
            if (c == '+') {
                push(a+b);
            } else if (c == '-') {
                push(a-b);
            } else if (c == '*') {
                push(a*b);
            } else if (c == '/') {
                push(a/b);
            } else {
                error("Invalid inputn");
                exit(1);
            }
        }
    }
    
    if (stack_size != 1) {
        error("Invalid inputn");
        exit(1);
    }
    
    print_answer(stack[0]);
    exit(0);
}

Теперь у нас имеются все функции, необходимые для реализации этого, давайте начнём.

_start:
    ; аргументы _start получаются не так, как в других функциях.
    ; вместо этого esp указывает непосредственно на argc (число аргументов), а 
    ; esp+4 указывает на argv. Следовательно, esp+4 указывает на название
    ; программы, esp+8 - на первый аргумент и так далее
    mov esi, [esp+8]         ; esi = "input" = argv[0]
    ; вызываем _strlen для определения размера входных данных
    push esi
    call _strlen
    mov ebx, eax             ; ebx = input_length
    add esp, 4
    ; end _strlen call
    mov ecx, 0               ; ecx = "i"
_main_loop_start:
    cmp ecx, ebx             ; если (i >= input_length)
    jge _main_loop_end
    mov edx, 0
    mov dl, [esi + ecx]      ; то загрузить один байт из памяти в нижний байт
                             ; edx. Остальную часть edx обнуляем.
                             ; edx = переменная c = input[i]
    cmp edx, '0'
    jl _check_operator
    cmp edx, '9'
    jg _print_error
    sub edx, '0'
    mov eax, edx             ; eax = переменная c - '0' (цифра, не символ)
    jmp _push_eax_and_continue
_check_operator:
    ; дважды вызываем _pop для выноса переменной b в edi, a переменной b - в eax
    push ecx
    push ebx
    call _pop
    mov edi, eax             ; edi = b
    call _pop                ; eax = a
    pop ebx
    pop ecx
    ; end call _pop
    cmp edx, '+'
    jne _subtract
    add eax, edi                 ; eax = a+b
    jmp _push_eax_and_continue
_subtract:
    cmp edx, '-'
    jne _multiply
    sub eax, edi                 ; eax = a-b
    jmp _push_eax_and_continue
_multiply:
    cmp edx, '*'
    jne _divide
    imul eax, edi                ; eax = a*b
    jmp _push_eax_and_continue
_divide:
    cmp edx, '/'
    jne _print_error
    push edx                     ; сохраняем edx, потому что регистр обнулится для idiv
    mov edx, 0
    idiv edi                     ; eax = a/b
    pop edx
    ; теперь заносим eax на стек и продолжаем
_push_eax_and_continue:
    ; вызываем _push
    push eax
    push ecx
    push edx
    push eax          ; первый аргумент
    call _push
    add esp, 4
    pop edx
    pop ecx
    pop eax
    ; завершение call _push
    inc ecx
    jmp _main_loop_start
_main_loop_end:
    cmp byte [stack_size], 1      ; если (stack_size != 1), печать ошибки
    jne _print_error
    mov eax, [stack]
    push eax
    call _print_answer
    ; print a final newline
    push 0xA
    call _putc
    ; exit successfully
    mov eax, 0x01           ; 0x01 = exit()
    mov ebx, 0              ; 0 = без ошибок
    int 0x80                ; здесь выполнение завершается
_print_error:
    push error_msg
    call _print_msg
    mov eax, 0x01
    mov ebx, 1
    int 0x80

Понадобится ещё добавить строку error_msg в раздел .rodata:

section .rodata
    ; Назначаем на некоторые байты error_msg. Псевдоинструкция db в NASM
    ; позволяет использовать однобайтовое значение, строковую константу или их 
    ; сочетание. 0xA = новая строка, 0x0 = нуль окончания строки
    error_msg: db "Invalid input", 0xA, 0x0

И мы закончили! Удивите всех своих друзей, если они у вас есть. Надеюсь, теперь вы с большей теплотой отнесётесь к языкам высокого уровня, особенно если вспомнить, что многие старые программы писали полностью или почти полностью на ассемблере, например, оригинальный RollerCoaster Tycoon!

Весь код здесь. Спасибо за чтение! Могу продолжить, если вам интересно.

Дальнейшие действия

Можете попрактиковаться, реализовав несколько дополнительных функций:

Выдать вместо segfault сообщение об ошибке, если программа не получает аргумент.
Добавить поддержку дополнительных пробелов между операндами и операторами во входных данных.
Добавить поддержку многоразрядных операндов.
Разрешить ввод отрицательных чисел.
Заменить _strlen на функцию из стандартной библиотеки C, а _print_answer заменить вызовом printf.

Дополнительные материалы

«Руководство по ассемблеру x86 университета Вирджинии» — более подробное изложение многих тем, рассмотренных нами, в том числе дополнительная информация по всем популярным инструкциям x86.
«Искусство выбора регистров Intel». Хотя большинство регистров x86 — регистры общего назначения, но у многих есть историческое значение. Следование этим соглашениям может улучшить читаемость кода и, как интересный побочный эффект, даже немного оптимизировать размер двоичных файлов.
NASM: Intel x86 Instruction Reference — полное руководство по всем малоизвестным инструкциям x86.

Источник

Programming in assembly language tutorial

This tutorial covers AMD64/Intel 64 bit programming. Instruction sets for other processors, such as ARM or RISC-V are radically different, though the concepts are the same. They all have instructions, registers, stacks, and so on. Once you know one processor’s assembly language, adapting to a different processor is rather easy.

I found that I was writing code for a new processor within hours, and writing quality code within a week or two. This is going from Z80 to 6502 to 6809 to 8086 to 68000 and so on. It is interesting to be able to look at a processor’s technical manuals and evaluate the power and flexibility of its instruction set.

This tutorial is aimed at novices and beginners who want to learn the first thing about assembly language programming. If you are an expert, you may or may not get a lot out of this.

Programming in assembly language tutorial
- Introduction
- Bits, Bytes, Words, and Number Bases
- Math
- Boolean Algebra
- Bit Shifting
- Memory
- ELF Files and the Loader
- Permissions
- MMU
  - Paging and Swapping
- Other exceptions
  - Segfault
  - Divide By Zero
  - Invalid Opcode
  - General Protection
- ALU
- x64/AMD64 Registers
  - General Purpose Registers
  - Special Purpose Registers
  - CPU Control Registers
    - Stack
    - Instruction Pointer
    - Flags
AMD64 Instruction Set
- Assembly source
- Addressing Modes
  - Register Operands
  - Direct Memory Operands
    - Indirect Operands
    - Indirect with Displacement
    - Indirect with displacement and scaled index
Commonly Used Instructions
- Aritmetic
- Boolean Algebra
- Branching and Subroutines
- Bit Manipulation
- Register Manipulation, Casting/Conversions
- Flags Manipulation
- Stack Manipulation
Assembler Source, Directives, and Macros
- Assembler Directives
  - section type
  - bits 16, bits 32, and bits 64, use16, use32, use64
  - Comments
  - Constants
  - Program Variables and Strings
  - Assembler Variables and Labels
  - Repetion
  - Macros
  - Conditional Assembly
  - Alignment
  - Structures
  - Includes
Hello, World
- MacOS Version
- Linux version
- How it works
  - Linux Syscalls
  - MacOS Syscalls

Introduction

How CPUs work has become something of a lost art. There are a small percentage of software engineers that need to understand the inner workings of CPUs, typically those who work on embedded software or operating systems, or compilers or JIT compilers…

Assembly language was one of the first languages I ever learned. Back in the early/mid 1970s, my high school classes progressed from BASIC to FORTRAN IV, to BAL (Basic Assembly Language) for the IBM 360 to which we had access. One of the earliest lessons we were taught used a cardboard teaching aid, CARDIAC. CARDIAC stands for «CARDboard Illiustrative Aid to Computation»; it was developed at Bell Labs, which was a big deal back then (Unix was invented there, as well as the C programming language).

See https://www.cs.drexel.edu/~bls96/museum/cardiac.html.

With CARDIAC, you simulated the memory, operation, and CPU cycles of a mythical CPU. The numbers and instructions for this CPU were in base 10, so the student doesn’t have to understand how to convert to the common base 2, base 8, 8 or base 16 used in computing. CARDIAC provided a cardboard device that had representation for memory, program steps, and ALU (math and logic operations).

You wrote your program and variables on the cardboard and then step by step, followed the program and performed the operations for each step. The steps are identified by a single digit, 0-9:

0 INP read a card into memory
1 CLA clear accumulator and add from memory
2 ADD add from memory to accumulator
3 TAC test accumulator and jump if negative
4 SFT shift accumulator
5 OUT write memory location to output card
6 STO store accumulator to memory
7 SUB subtract memory from accumulator
8 JMP jump and save PC
9 HRS halt and reset

These values are «opcodes» and the encoded instructions/steps include the opcode plus address, number of bits to shift, etc.

The CPU features only two registers: accumulator and program counter. More complex and modern CPUs have many more registers than these two.

These instructions and registers are enough to learn from. You learn about memory layout, instruction opcodes, instruction encoding, memory access, and so on.

In this tutorial, I will cover the basics of programming the x64/AMD64 CPU in assembly language. As I progress, you will see how the CPU is really a glorified version of CARDIAC!

Bits, Bytes, Words, and Number Bases

The smallest piece of information that a CPU processes is a «bit.» A bit is a small integer or boolean type value, either 0 (off/false) or 1 (on/true).

Bits are then organized as «bytes», or 8 bits grouped together. You can visualize a byte like this:

The digits represent what we call a bit number, and each digit (bits 0-7))may be a 0 or a 1. A byte can represent an unsigned value of 0-255, or a signed value of -128-127. Bit 7 of the byte is considered the «sign bit» — if it is 1, then the byte as a signed value is negative, if it is 0, then the byte is positive. Note that you decide whether the byte is processed as signed or unsigned; more on this later, but for now it is important to understand how the bits make up bytes and signed/unsigned values are represented.

A «word» is two bytes grouped together, which means we have 16 bits together. You can visualize a word like this:

The high order, sign bit, is bit 15.

The x86 also has DWORD values, which are two words combined. It also has QWORD values which are two DWORDs combined. The pattern is the same for any of these size values — the high bit is the sign bit, etc.

From this point forward, I’ll use «word» to mean one of these sized values, unless otherwise stated.

When we talk about the value of the word, we typically use base 2, base 4, base 8, base 10, and base 16. Of these, base 8 isn’t used much, but I’ll explain a common use case for base 8.

In base 2 (also called «binary»»), we simply talk about the value as the bits. That is, an unsigned byte might be 11111111, or 11101110, and so on. We might add a lead 0 and terminating b for clarity (and this is the syntax used in assembly programming): 011111111b.

Base 10 is the number base we use every day. You count from 0 to 9 for each digit position in base 10. When you add 1 to the value 9, you clear it (set to 0), and bump the 10s digit. That is, 9+1 becomes 10. As you go right to left in base 10, the digits are: n x 10 to the power of 0, n x 10 , or 10 to the power of 1, n x 100, or 10 to the power of 2, and so on.

In base 2, we count from 0 to 1 for each digit position. When you add 1 to a 1 in a position in the byte, you clear it and increment the next higher bit (and continue until you find an existing 0 in position, which becomes 1). As you go right to left in base 2, the digits are n x 2 to the power of 0, n x 2, or 2 to the power of 1, n x 4, or 2 to the power of 2, and so on.

In base 8 (also called «octal»), we count from 0-7 for each digit position. Going right to left, n x 8 to the power of 0, n x 8 to the power of 1, n x 8 to the power of 2, etc.

In base 16 (also called «hex»), we count from 0-15 for each digit position. We use a counting system that is 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, then 10. So going from right to left in a hex number, the digits are n x 16 to the power of 0, n x 16 to the power of 1, n x 16 to the power of 2, and so on.

A «nybble» is useful for working with hex. A nybble is 4 bits. It turns out that the value you can store in 4 bits is 0-15, perfect for hex. You already get the pattern about power of 4s when using nybbles.

Let’s look at the unsigned value ranges for the common word sizes:

1 bit: 0-1
2 bits: 0-3
3 bits: 0-7
4 bits: 0-15
5 bits, 0-31
...

The pattern here is that the max value is 2 to the number of bits minus 1. That is for 5 bits, the max value 31 is 2 to the 5th power (32) minus 1.

When we convert a binary byte to hex, we visualize it something like this:

We’ve grouped the bits as two nybbles. We can then convert the two nybbles (4 bits each) to two hex digits.

This table makes the conversion simple. But if you practice using hex, you will know this table by heart.

For example, we visualize the binary value 010100101b as 1010 0101. Using the table above, we see 1010 is A, and 0101 is 5. So the byte value is A5. We represent hex numbers in assembly as 0xa5, or 0a5h, or sometimes $a5.

We can use the same scheme to convert 16 bit or 32 bit or 64 bit values to hex!

I promised to discuss a use for Octal, something we might use every day. In the linux/mac/*nix filesystem, permissions are actually octal values.

-rw-r--r--  1 mschwartz  staff   5.9K Feb 16 14:13 README.md

See the -rw-r—r— ? What we have here is 9 bits in octal. rw- is 110, r— is 100, r— is 100. So we can convert this to the internal filesystem represenation of 644. If you want to make a file rw-r—r—, you use the chmod command:

The three bits, technically, are «able to read», «able to write», and «able to execute.» The first octal value is for the owner, the second is for anyone in the same user group as the owner, and the third is for everyone else. So to allow the owner and group to read and write, but nobody else can read or write the file, we want rw-rw—- or 660. To set a file to be executable, I typically use chmod 755.

Math

Adding two values of the same word size is simple. The byte 100 plus the byte 50 = 150. 100 + 50 = 150.

This works for signed and unsigned values. The math is always unsigned, but the result is up to you. If the high order bit (bit 7 of a byte, bit 15 of a 16-bit word…) is 1, the signed value is negative.

What happens when we add a byte value to a 16-bit word value? The byte value is really a 16-bit value, but the upper 8 bits are zeros. That is, 0xaa can be visualized as 0x00aa. We just add the full 16-bit values together.

What happens when we add 1 to a byte size value of 255? We only have 8 bits for the result, but we have 9 bits of actual value. That is, 255 + 1 is 256. Represented in binary, you have 255 = 011111111b + 1 = 0100000000b (9 bits!). The 9th bit is basically ignored as far as the result byte goes (more on this later). So if you look at the lower 8 bits of our 9 bit result, we get 0!

All this extends to 32 bit and 64 bit words.

Multiplication of two values requires a double-sized result, or you lose a lot more than just the 9th bit! Consider 255 x 255 = 65025 (0xfe01), which fits in 16 bits but not in 8. If we have a byte result, we get 0x01 due to the overflow, losing over 65000 in result value.

Boolean Algebra

Boolean Algebra is a form of math that we use to deal with true/false values. We use Boolean Algebra all the time in various programming languages, with operators like & (AND), | (OR), ^ (exclusive XOR, or XOR), and ! (NOT), ~ (also NOT) and so on. These operators are equivalent to «math-like» operators.

The simplest way to visualize Boolean Algebra is using single bit values and truth tables. 0 = false, 1 = true. For single bit value operands, there are only (always) 4 combinations possible.

AND (if both operands are true, the result is true)
0 & 0 = 0
0 & 1 = 0
1 & 0 = 0
1 & 1 = 1

OR (if either operand is true, the result is true)
0 | 0 = 0
0 | 1 = 1
1 | 0 = 1
1 | 1 = 1

XOR (if only one operand is true, the result is true)
0 ^ 0 = 0
0 ^ 1 = 1
1 ^ 0 = 1
1 ^ 1 = 0

The ! (NOT) operator only has one operand. If the operand is true, the result is false. If the operand is false, the result is true. The result is also known as a 1’s complement, or we’ve just inverted the state of all the bits.

The ~ (1’s complement) operator inverts the bits in the word.

If we look at the operands as byte values, we have something like:

00000000 & 00000000 = 0
00000000 & 00000001 = 0
...

BUT, we have 8 bits, so the operation is performed on all 8 bits in the two operands.

   10000000 
OR 00000001 
   --------
   ^      ^
=  10000001
   ^      ^
   
NOT 10000001
=   01111110

This is a most important concept to grasp!

We use the Boolean Algebra operators on words to achieve useful results.

A typical use of the AND operator is to clear bits in a value. If we AND with a value that is a power of 2, we are simple clearing a bit. n AND 4 clears bit 3 in n.

A typical use of the OR operator is tto set bits in a value. If we OR with a value that is a power of 2, we are simply setting a bit. n OR 4 sets bit 3 in n.

A great use of the AND operator is to do a modulo of a number to a power of 2. For example, AND with 3 gets you a result between 0 and 2. AND with 7 gets y ou a result between 0 and 8.

Bit Shifting

You can shift a byte to the left (<< operator in C) 1-7 bits. For example:

001111101b << 1 = 001111100b

 001111101b  shifted left becomes
 ////////
x01111100b  (bit 0 becomes 0)

Note that we have the overflow problem here, as we did with addition. We have an upper bit that ends up in the «bit bucket» (thrown away).

A left shift of 1 bit is effectively a multply by 2. Consider 001b<<1 is 010b, or 2. A left shift of 2 bits is a multiply by 4, and so on.

Shifting to the right works similarly, but we now end up with the high bit being cleared and the low bit in the bit bucket.

A right shift of 1 bit is effectively a divide by 2. But this right shift will take a negative number and make it positive because the sign bit is cleared. So we need a second kind of right shift (arithmetic shift right) for signed values that sets the high bit in the result to the high bit in the initial value.

A rotate left/right is the same as a shift, except instead of the lost bit ending up in the bit bucket, it becomes the new high/low bit.

Other than for the multiply and divide effects, we use bit shifting frequently with Boolean Algebra. To set bit 3:

n | (1<<3)

To clear bit 3:

n & ~(1<<3)

Note that 1<<3 = 01000b, 
and ~(1<<3) is  ~01000b 
              or 00111b.   (all the bits are inverted)
When you AND with 00111b, you are clearing bit 3.

Memory

Memory (RAM) can be viewed as an array of bytes. If you have 1MB of RAM, your array is indexed from 0 to 1MB-1. The index is better known as an address.

Memory is used to store your program, for your program stack, for your program’s heap (memory allocation) and to store your variables. In a simple CPU and RAM setup, you might have your program start at index 0, your variables start at the end of the program, your heap starts at the end of your variables, and your stack starts at the top of memory and works its way downward as you push onto it.

HIGH memory address
+--------------+
|              |
| stack        |
| grows down   |
| address 1M   |
|              |
+--------------+
|              |
| heap         |
| grows up     |
|              |
+--------------+
|              |
| uninitalized |
| global       |
| variables    |
|              |
+--------------+
|              |
| initalized   |
| global       |
| variables    |
|              |
+--------------+
|              |
| code         |
| address 0    |
|              |
+------------+
LOW memory address

ELF Files and the Loader

The compiler/assembler/linker generate ELF formatted files. An ELF file is divided into various sections. The more common sections are .text (code), .data initialized data, .rodata read only data (constants), .bss (uninitialized data), and assorted debugging info sections.

The operating system program loader reads in the ELF file and allocates memory for the .text section and loads that data from the file into that memory.

Then the loader allocates memory for the initialized data (.data) and reads that data from the file into that memory.

Then the loader allocates memory for the constant data (.rodata) and reads that data from the file into that memory.

The loader allocates memory for the .bss section. Since the .bss section is uninitialized, it only needs to be allocated.

The linker reads in intermediate object files (.o) and links them together to make the final executable. Each .o file may declare variables that might be accessed from other .o files and to access variables that are defined in some other .o file. The linker fixes up the addresses in the final output so the code works as expected!

Permissions (Sections and Privileged Instructions)

The compiler/assembler/linker generally makes the code execute only. If you try to store to those addresses, you will get a segfault.

The .data and .bss sections are marked as read/write and the .rodata is marked as read-only.

The way words of the different sizes are stored in memory is determined by the «endianess» of the CPU. A CPU that is big endian stores the high byte first in memory, the next highest byte next, … and finally the lowest byte last. A CPU that is little endian stores the low byte first, … the high byte last.

The CPU has special features that enforce these permissions. If you try to defeat the permissions, a segfault exception is thrown. The operating system sets up these features when the program is started, and kills the program and potentially (generates a core dump file of the program. The core dump file can be used later to do forensic debugging/analysis of the failure.

MMU

In modern operating systems, the CPU uses an MMU (Memory Management Unit) to assign a subset of the system’s memory to each program that you run. The MMU maps an address in physical memory to a logical address that the program sees and uses. This allows, for example, a CPU to split the 1MB of RAM into 2x 512K address spaces to run two programs. The address translation makes it so each program thinks it has 512K of RAM starting at address 0 and ending at address 512K — 1.

The use of the MMU is much more clever than I just explained, but the end result is the same. When a program is launched, it is allocated a small amount of RAM, enough for the program’s code and variables and stack and a minimal heap. As the program needs more stack or more heap, the OS adds physical memory to the program’s address space using the MMU. The program grows on demand.

For our purposes, we can assume we’re the only program running on the machine. It matters not if there’s an OS using the MMU or not, the programming effort and techniques are the same either way.

Paging and Swapping

The operating system only needs to set up the MMU for enough physical memory for the program to execute. Memory is allocated for the MMU in 4096 byte chunks (pages); this is required by the MMU implementation (hardware).

This scheme is quite efficient, as a small assembly program might only need a couple of megabytes of RAM (2MB for stack is default in the OS!), and your computer might have 16 Gigabytes of RAM. This efficient allocation of the CPU’s memory allows you to load and run many programs at the same time.

When your program tries to access an address in memory that isn’t mapped by the OS using the MMU, a page fault exception is raised. The OS sees this and might map in an additional page so that the access can succeed.

If the system is out of memory, the OS might compress programs and/or their data to make more RAM available. The OS has to decompress this memory when it’s those programs’ turn to execute, though. MacOS does this compression, and it’s very clever.

Another thing the OS can do when there is an out of memory (OOM)) condition is to «page» one or more 4096 byte pages from memory to the system’s swap file/partition. This frees up enough pages to use to handle the page fault. When a program that has memory paged to disk is scheduled to run (use the CPU), the code might cause further page faults to read back in the paged memory. It’s possible the program never accesses that memory, and that’s perfectly fine.

Yet another thing the OS can do is to swap out entire programs (and their data) to the swap file/partition. When those programs get to run, they have to be entirely read back into memory (and MMU set up), and perhaps swapping another program to disk. When the system is tight on free memory and is swapping heavily, it will become very unresponsive!

Finally, if the OS cannot resolve the OOM condition with one of those (or potentially other clever) strategies, it just randomly kills a running program. This seems evil, but what else can it do?

The stack grows down from high memory. If the stack overflows (grows below the memory allocated for it), a page fault occurs and the OS can add additional pages to the memmory map so the stack has more room.

The heap initially has a small but reasonable amount of RAM allocated. It can be expanded using the sbrk syscall. This is what the malloc() function does in C, though the sbrk() function can be called directly if you know what you’re doing.

Other exceptions

Segfault

It should be noted that a program might just randomly access some address that is truly outside the bounds of the program’s memory map. Paging or swapping is not performed in this case. The MMU is set up so these addresses are simply not mapped into the program’s memory map. Instead of raising a page fault exception, the CPU/MMU raises a segfault exception.

This is a hard program crash, and the operating system will terminate the program.

Divide By Zero

If your program attempts to divide by zero, this exception is raised and the program is terminated.

Invalid Opcode

If your program somehow executes instructions that are not valid x64/amd64 instructions, this exception is raise and the program is terminated. This will occur, for example, if you push a random number on the stack and then return. Your program starts executing at that random address and who knows what data are there? If the random number/return causes the program to execute outside its address space, you get a Segfault instead.

General Protection

If your program attempts to execute a privileged instruction, this exception is raised and the program is terminated. There are quite a few privileged instructions, such as CLI/STI (disable/enable interrupts). An OS should not allow programs to disable interrupts, or your multitasking stops working!

ALU

The cost of having circuitry to add two arbitrary memory locations together is prohibitive. You have 1M x 1M add circuits required, and that’s just for addition!

The math (add) capability is, instead, implemented in the ALU (Arithmetic-Logic Unit) of the CPU. The CPU provides some (small) number of general purpose «registers» and the ALU implements the add circuitry just between those registers.

You can think of a register as a (temporary) variable that is on chip, usable by the ALU to do math and logic operations. You have to load your operand or operands into registers to perform math, then you can store the result to a variable in memory.

For example, to add two numbers at memory locations (addresses) 0x100 and 0x200 and store the result at address 0x300, and we have two registers named a and b:

  load value at 0x100 into a
  load value at 0x200 into b
  add a and b, leaving result in a
  store a at 0x3000

I have just introduced something like a snippet of assembly language code! We need operations to be able to load memory into registers, add registers together, and store registers to memory. Each of these operations is a CPU «opcode.» The CPU reads the byte opcode from memory and executes it. Some opcodes, like the load and store ones require parameters like the address to load from or store to. These addresses are stored in the program immediately following the opcode. As we progress, we’re going to see that the instruction sizes (op code plus parameters) are different depending on the instruction (op code) and parameters.

In the simplest view of the CPU, the above program is 4 instructions. The load and store instructions use 1 byte for opcode and 2 more for the addresses. The add uses just the one byte for the opcode (add b to a).

Each instruction uses 1 or more «clock cycles,» depending on the complexity of the operation. The load instruction requires a clock cycle to load the opcode, another 2 for each byte of the address, and another 2 to load the value from RAM at the address specified in the parameters, for 5 total clock cycles. The add instruction takes just 1 clock cycle. The store takes 5 as well.

x64/AMD64 Registers

For all intents and purposes, the Intel and AMD processors have the same registers until you get into exotic features (like hardware video decoding). I use the term x64 and AMD64 interchangable throughout this tutorial.

General Purpose Registers

You have 4 general purpose registers, A, B, C, and D, though we don’t use these specific names for the registers. The size of the register/contents matters. So for a byte value, we use AL or AH, or BL/BH, or CL/CH, or DL/DH. The L means «low order byte» and H means «high order byte.» For word values, we use AX, BX, CX, and DX. For 32 bit word values, we use EAX, EBX, ECX, and EDX. And for 64 bit word values, we use RAX, RBX, RCX, and RDX.

When we use the registers whose size are smaller than 64 bits, the remaining bits in the register are not affected. For example, if AX contains 0x0102 and we load 0x03 into AL, AX will contain 0x0103. This will only matter if you load bytes into registers and add word registers together, in error. There might be tricks you play to take advantage of the nature of the register loads/stores.

AMD64 and x64 add 8 more general purpose registers, R8, R9, R10, R11, R12, R13, R14, and R15. These are accessed as 8, 16, 33, and 64 bit registers. R8D through R15D (32 bits), R8W-R15W (16 bits), R8B-R15B (8 bits), and R8-R15 (64 bits).

Special Purpose Registers

The RCX/RCX/CX (CX) register doubles as a counter for dedicated instructions. The AMD64 instruction set includes instructions to fill, copy, and compare memory, and loops that use this register as the number of bytes/words/dwords/qwords to fill/copy/compare. The special loop instructions use this register as the loop counter as well.

The RSI/ESI/SI and RDI/EDI/DI/ registers are general purpose «source» and «destination» registers for the fill, copy, and compare instructions.

The RBP register is a general purpose register that is typically used as a base address register or by high level language compilers to maintain function stack frames (arguments, return address, and local variables allocated on the stack).

CPU Control Registers

Stack

The RSP register contains the address of the last thing pushed on the processor stack. You can push registers on the stack to preserve their values, you can pop them to restore their values, address values already on the stack by index, etc.

Instruction Pointer

The RIP register contains the address of the next instruction to be executed. The CPU automatically adds the correct number to it as it executes instructions to keep it pointed at the correct next instruction. When you call a subroutine, the RIP is pushed on the RSP stack and RIP is loaded with the address of the subroutine. When the subroutine returns, the RIP that was pushed before the call is popped from the stack into RIP. Execution continues at the instruction after the call.

Flags

The FLAGS register is 64 bits containing information provided by the CPU to the program, and commands from the program to the CPU. Not all the bits are used. See https://en.wikipedia.org/wiki/FLAGS_register.

An example of the bits in FLAGS set by the CPU is the Carry Flag. It is set when you have a carry after an arithmetic operation. For example, if you add 1 to the AL regsister that contains 255, you will get AL=0, Carry = 1. If you add 1 to AL=254, the Carry will be 0.

An example of the bits in the FLAGS set by the program is the Direction Flag. If this is 0, the fill/copy/etc. instructions work from start address forward (auto-increments SI and DI). If this is 1, the operations are done backward (auto-decrement).

The FLAGS register is there to use, but we might really only directly use the Carry bit and Direction bit. We might use the Carry bit to return a true/false resul�t from a function. The CLC and STC instructions clear and set the Carry bit.

The various branch instructions use the Carry and Zero bits internally.

There are several instructions that set and clear these bits, programatically.

AMD64 Instruction Set

You will learn the instruction set as you go. The instruction set is documented as a reference manual, not a programming manual. That is, each instruction is documented as to what it does. But there is no particular «how to use this instruction» documentation. «

You can find the instruction set documented on various Web Sites. The best source is the Intel Programmer’s Manual or the AMD64 Programmer’s Manual.

Here is a decent Web Page that lists the instructions in a table, one line per instruction with a short description.

https://www.felixcloutier.com/x86/

There are over 1500 instructions, from AAA to XTEST that we can use. Too many to document every one here. However, there are much fewer commonly used instructions that we use for most things.

The format of a line of source code in assembly is:

[optional label] instruction
or
[optional label] instruction operand
or
[optional label] instruction operand1, operand2

When assembled, the instructions are encoded as opcode and operands as a sequence of bytes. The CPU is able to execute these instructions.

Assembly source

In assembly source, the NASM assembler expects operands to be specified as destination, source (Intel syntax) while the gas assembler expects operands to be specified as source, destination (AT&T syntax). The assembler language for the various CPUs (e.g. MC68000, AMD64, ARM, etc.) each specify whether the left operand is source or destination. The gas assembler can be used to assemble source for various processors so it defaults to source, destination format, though you can tell it to use Intel (NASM) syntax.

In Intel syntax source programs, the semicolon (;) character introduces the start of a comment. All characters from that point on, to the end of the line, are ignored.

Before we look at some of these instructions, we need to look at addressing modes.

Addressing Modes

Addressing modes are the means by which operands to instructions are described and how they execute. For example, Register operands indicate specific registers, but memory operands can be addressed through a variety of combinations of offsets and/or regsister contents.

To examine the addressing modes, we’ll use the MOV instruction, which copies a value in a register to memory or loads a value to a register from memory.

The source and/or destination operand is specified using one of the addressing modes.

The instruction-set/addressing.asm file contains example usage of the various addressing modes.

Register Operands

Rather than memory being the source or destination, the operand is a register. For example,

	mov rax, rbx ; moves contents of rbx register into the rax register.

Direct Memory Operands (better known as Immediate operands)

This mode moves a constant into a register. The constant is encoded in the instruction, after the opcode. For example,

        mov rax, 10 	; source operand is a constant

Indirect Operands

This mode uses a register as the address of a memory location to be operated on (e.g. load from, store to). For example,

        mov (rax), rbx   ; store contents of rbx to memory location contained in rax

Indirect with Displacement

This mode uses a register as the base address of a memory location, added to a fixed offset, to determine the address of a memory location to be operated on. For example,

        mov rax, [rbx+24]  ; access memory at 24 + contents of rbx

The purpose of this addressing mode is to facilitate accessing a structure and its members. Consider:

struct {
  char *name,
       *address,
	   *phone;
} person;
person.name = nullptr;
person.address = nullptr;
person.phone = nullptr;

In assembly, we’d do something like this:

NAME equ 0
ADDRESS equ 8
PHONE equ 12

mov rsi, person  ; load address of person into RSI
mov rax, 0       ; nullptr
mov NAME[rsi], rax
mov ADDRESS[rsi], rax
mov PHONE[rsi], rax

Another use of this addressing mode is for stack frames for a language such as «C», especially for calling subroutines. A subroutine may have arguments passed to it on the stack, by value (like an int) or reference (like an address of a struct or string or whatever). A subroutine may need its own local variables. When a subroutine is called recursively, each recursive call must prepare the stack so it has arguments to pass, and allow for the next iteration’s local variables on the stack.

The RBP register is used for stack frames when stack conventions are used for calling functions in «C».

The calling function pushes arguments on the stack (right to left). That is, for foo(a, b, c);, the compiler will generate code to push c, then b, then a.

Upon entry to a function, RBP contains the stack frame pointer for the calling function. The compiler generates code to immediately push it. Then the RSP stack pointer is loaded into RBP.

At this point, RBP points to the return address on the stack, and negative offsets from RBP are the arguments to the function.

For local variables, the compiler generates a subtract to RSP to make the desired space on the stack. When the function calls another, RSP is after the allocated variables, so it all works. Positive offsets from RBP are used to access the local variables.

To return, the compiler generates code to pop rbp (restore caller’s stack frame) and returns. The calling code has to adjust RSP to remove the pushed arguments.

Note: AMD64/X64 use a register scheme for passing arguments to functions and uses the stack when there are too many arguments to pass (not enough registers). See https://en.wikipedia.org/wiki/X86_calling_conventions. I present this information because you will likely run across stack frames, especially when viewing GDB (command line debugger) backtraces.

Let’s see a little bit of example code and the assembly generated by the compiler. Note that this is in AT&T syntax, source, destination format. The register names are prefixed with %.

// source
void bar(int a, int b) {
    int x, y;

    x = 555;
    y = a+b;
}

void foo(void) {
    bar(111,222);
}

; compiles to:
bar:
    pushl   %ebp
    movl    %esp, %ebp
    subl    $16, %esp
    movl    $555, -4(%ebp)
    movl    12(%ebp), %eax
    movl    8(%ebp), %edx
    addl    %edx, %eax
    movl    %eax, -8(%ebp)
    leave
    ret

foo:
    pushl   %ebp
    movl    %esp, %ebp
    subl    $8, %esp
    movl    $222, 4(%esp)
    movl    $111, (%esp)
    call    bar
    leave
    ret

Note the use of indirect with offset addressing modes!

Indirect with displacement and scaled index

This addressing mode is used to access array elements. To illustrate how this mode works:

an array of bytes, each element is 1 byte each
an array of words, each element is 2 bytes each
an array of dwords, each element is 4 bytes each
an array of qwords, each element is 8 bytes each

As you index the array, you have to «scale» the index before adding it to the base of the array. The scale operating assures we are addressing byte, word, dword, or qword elements properly.

        mov member(rsi, rbx, 4), eax   ; store dword in eax at rsi+ member(offset) + rbx x 4

The above example stores a dword into memory. We are accessing a struct member that is an array of dwords. The rbx register contains the index into the array, [0 … array.length-1]. The 4 is the scale factor, or size of the dword.

Note that member may be 0 — in this case, rsi simply contains the address of the array.

Commonly Used Instructions

Aritmetic

ADC - add a value, plus 
ADD - add two registers together
DEC - decrement by 1
DIV - unsigned divide
IDIV - signed divide
IMUL - signed multiply
INC - increment by 1
MUL - unsigned multiply
NEG - two's complement (multiply by -1)
SBB - subtract with borrow (carry flag)
SUB - subtract
LEA - load effective address (formed by some expression / addressing mode) into register

Boolean Algebra

AND - logical AND to registers together
NOT - one's complement (invert all the bits in the operand)
OR - logical OR
XOR - logical exclusive or
TEST - logical compare

Branching and Subroutines

CALL - call a subroutine/function/procedure
SYSCALL - call an OS function (Linux, Mac)
ENTER - make stack from for procedure parameters
LEAVE - high level procedure exit
RET - return from subroutine
CMP - compare two operaands
JA - jump if result of unsigned compare is above
JAE - jump if result of unsigned compare is above or equal
JB - jump if result of unsigned compare is below
JBE - jump if result of unsigned compare is below or equal
JC - jump if carry flag is set
JE - jump if equal
JG - jump if greater than 
JGE - jump if greater than or equal
JNC - jump if carry not set
JMP - go to / jmp (simply loads the RPC register with the address)

Bit Manipulation

BT - bit test (test a bit)
BTC - bit test and complement
BTR - bit test and reset
BTS - bit test and set
RCL - rotate 9 bits (carry flag, 8 bits in operand) left count bits
RCR - rotate 9 bits (carry flag, 8 bits in operand) right count bits
ROL - rotate 8 bits in operand left count bits
ROR - rotate 8 bits in operand right count bits
SAL - arithmetic shift operand left count bits
SAR - arithmetic shift operand right count bits (maintains sign bit)
SHL - logical shift operand left count bits (same as SAL)
SHR - logical shift operand right count bits (does not maintain sign bit)

Register Manipulation, Casting/Conversions

MOV - move register to register, move register to memory, move memory to register
XCHG - exchange register/memory with register
CBW - convert byte to word
CDQ - convert word to double word/convert double word to quad word

Flags Manipulation

CLC - clear carry flag/bit in flags register
CLD - clear direction bit in flags register
STC - set carry flag
STD - set direction flag

Stack Manipulation

POP - pop a register off the stack
POPF - pop stack into flags register
PUSH - push a register on the stack
PUSHF - push flags register on the stack

Assembler Source, Directives, and Macros

The assembler is a program that reads assembly source code and generates a binary output file or ELF .o file. The assembler reads a line at a time and writes the encoded program instructions for that line to the output file.

NASM is a great free assembler, LLVM Assembler (as), and Gnu Assembler/as/gas (part of the gcc package) are two assemblers that are used for Linux and MacOS assembly development/programming. For all intents and purposes, LLVM and Gnu assemblers are identical. There are other assemblers out there, but they are beyond the scope of this tutorial.

There are two styles of assembly source for x64: Intel and AT&T.

Intel syntax expects operands to be specified as destination, source.
AT&T syntax expects operands to be specified as source, destination.

The NASM assembler uses Intel sysntax and the GNU/LLVM assemblers can use either Intel or AT&T; you choose which using an assembler directive.

Assembler Directives

An assembler directive is not machine instructions. Instead, these are used to convey information to the assembler to effect code generation as you prefer. Assembler directives are specific to the assembler you are using and the source code using these is not portable between assemblers. The nature of (order of) Intel and AT&T syntax makes code written for one not portable to an assembler using the other.

The gas (gnu/llvm) assembler uses the .intel_syntax directive to tell the assembler that the source format of the file is Intel syntax. Otherwise AT&T syntax is assumed.

I’m not going to expand on all the directives for gas and NASM. There are basically similar directives for both assemblers. I prefer using NASM, though there is no reason you can’t use gas — whichever you prefer. I’ll document the common NASM directives here.

There are a lot of directives; I’m not covering all of them. For expanded information, see the NASM manual online at https://nasm.us. Hopefully, you find what is covered here to be enough to get you going.

section type [options]

The section directive specifies that the following instructions/directives apply to the specified section. Examples:

section .text
section .bss execute
section .rodata

These types were defined earlier in this document. The execute option marks this bit of .bss as read/write and execute permissions.

bits 16, bits 32, and bits 64, use16, use32, use64

These directives tell the assembler to generate instructions for the CPU running in the specified mode.

When the system first boots, the CPU is in 16 bit mode. The instructions it executes at that point must be bits 16 or use16. You probably won’t be writing code for 16 bit mode.

A 32-bit operating system sets the CPU into 32 bit mode. The instructions it executes at that point must be bits 32 or use32.

This document assumes 64-bit mode, so we use bits 64. In 64-bit mode, the assembler can generate either 64-bit or 32-bit instructions, whichever is appropriate.

Comments

In a NASM source program, the semicolon (;) character introduces the start of a comment. All characters from that point on, to the end of the line, are ignored.

Note that gas supports a couple of comment styles, including /* */ C-style multiline comments, or pound sign # to instroduce the start of a comment.

Constants

NASM supports constants of the form:

0x10 ; base 16
010h ; base 16
011100b ; base 2

Program Variables and Strings

Programming is uselss if you can’t create variables and create and operate on strings. The assemmbler provides directives to reserve space for variables or to define initialized memory.

Resserving space examples:

    resb 1  ; reserve 1 byte
	resw 1  ; reserve 1 word (2 bytes)
	resd 1  ; reserve 1 dword (4 bytes)
	resq 1  ; reserve 1 qword (8 bytes)
	resb 16 ; reserve 16 bytes
	...

Initializing memory examples:

     db 10  ; reserve 1 byte with the value 10 at the memory location
     dw 11  ; reserve 1 word with the value 11 at the memory location
     dd 10  ; reserve 1 dword with the value 10 at the memory location
     dq 10  ; reserve 1 qword with the value 10 at the memory location
	 db 10, 11, 12 ; reserve 3 bytes with values 10, 11, and 12
	 ...

You can use the memory initializer directives for strings:

     ; create a null terminated string
     db 'now is the time for all good men to come to the aid of their country!', 0
	 ; create a null terminated string with carriage return/linefeed at the end
     db 'now is the time for all good men to come to the aid of their country!', 13, 10, 0

Assembler Variables and Labels

A label is a type of variable, and is the first thing on a line of source code. The value of the label is the current program counter as viewed by the assembler and when the program is actually running. You typically use a label to define a variable to access from assembly code or the address for jumps or subroutines.

You use the global directive to make a label’s scope visible to other .o files at link time. If you want to reference a label defined in a different .o file, you use the extern directive.

			section .text
			...
; find length of message
			mov rsi, message    ; load address of message into rsi
			call length
			; print rcx, it has the length of the string
			...
			mov rsi, external_message
			call length
			; print rcx, it has the length of the string
			...
length:			
			xor rcx, rcx        ; fast way to set rcx to 0
loop:
            mov al, [rsi]       ; get character from string
			inc rsi             ; point to next character
			inc rcx             ; increment length counter
			test al, al
			jne loop
			; rcx has the length of the string 
			ret
            ...

			section .rodata
			global message
message:    db 'hello, world!', 13, 10, 0 ; you can access message in an instruction:

A Variable is a string of text that refer to any numeric value you like, with a few exceptions. A common use is to define constants/expressions, as you would use #define in «C». You use the EQU directive to specify the variable’s value.

Examples:

ANSWER  equ 42
CR      equ 13
NEWLINE equ 10
STDIN   equ 0
STDOUT  equ 1
STDERR  equ 2

The $ character can be used in these expressions, too. It represents the current value of the program counter as the assembler sees it.

			section .text
			mov rax, message ; load address of message into rax
			move rcx, message_len

			section .rodata
message:    db 'hello, world!', 13, 10 ; you can access message in an instruction:
message_len equ $ - message ; length of message string in bytes

You can also use the %assign directive to create and update a variable. If you try to use EQU twice on the same variable name, it is an error.

%assign count 0
%assign count count+1

There is a directive to assign a string to a variable, too. This is similar to the «C» #define preprocessor directive; the string is substituted in the source code when the variable is encountered.

%define hello 'hello, world!', 13, 10
			section .text
			mov rax, message ; load address of message into rax
			move rcx, message_len

			section .rodata
message:    db hello
message_len equ $ - message ; length of message string in bytes

You can undefine one of these variables created with %define using %undef.

You can use local labels so you don’t have to keep track of every label/variable you have defined to avoid collisions. A local label begins with a period. Its scope is valid only between two true labels.

; subroutines to return address of string in RSI
get_string1:
            mov rsi, .string
			ret
.string:    db 'string1'

get_string2:
            mov rsi, .string
			ret
.string:    db 'string2'

Creating a variable or label does not generate any code!

Repetion

The times directive is used to repeat an initialization:

        section .data
stars:  times 32 db '*' ; creates 32 bytes containing * at memory location "stars".

Macros

A macro is similar to a subroutine, but is substituted inline and has powerful text processing/substitution factilities.

A macro is defined using the %macro and %endmacro directives. Everything between these two directives is the content of the macro, or the text to be substituted. The %macro directive requires the number of parameters to the macro.

; two handy macros that save me a lot of typing.
%macro pushg 0
    push rax
	push rbx
	push rcx
	push rdx
%endmacro

; note these have to be popped in the reverse order they are pushed!
%macro popg 0
    pop rdx
	pop rcx
	pop rbx
	pop rax
%endmacro

    ...
	; short and convenient
	pushg
	; use registers rax, rbx, rcx, rdx
	popg

If you want to pass arguments to your macro, you specify a non-zero number on the %macro directive. Within the macro body, you can access the parameters using %1, %2 and so on. Here’s a macro definition that demonstrates some of the power of macros.

%macro print 1
    mov rsi, .message
	call print_message
	jmp .over
.message: db '%1', 0
.over:
%endmacro

   ...
   print "hello, world!"

The problem with our print macro is that it generates .message and .over local labels and you might use the macro more than once between real labels:

   print "hello, world!"
   print "goodbye cruel world!"

What happens is we have duplicate local labels and the compiler generates an error. Local labels are incredibly useful in macros, so there has to be a way, and there is. Local labels within macros are defined using the form %%label. The assembler generates a uniqe label name when expanding the macro. This is the working print macro:

%macro print 1
    mov rsi, %%message
	call print_message
	jmp %%over
%%message: db '%1', 0
    align 8
%%over:
%endmacro

Conditional Assembly

NASM provides %if, %elif, %else, and %endif directives that allow for conditional assembly.

; a totally contrived useless example, for illustrative purposes
%assign foo 1
%if foo=1
   mov rax, 32
%else
   mov rax, 42
%endif

NASM also provides %ifdef directive that works with %elif and the other conditional assembly directives. Instead of testing a condition as %if does, it tests the existance

; comment out the undef to enable the LINUX "do things" code
%define LINUX
%undef LINUX
%ifdef LINUX
; do linux things
%endif
%else
; do mac things
%endif

NASM provides the %ifidn directive that works with %elif and the other conditional assembly directives. NASM provides default defined variables that you can use to conditionally assemble using %ifidn. A particularly useful one is ?OUTPUT_FORMMAT? which you can test to determine whether to generate code for Linux or MacOS (or other):

%ifidn __?OUTPUT_FORMAT__, maco64
  ; do macos stuff
%else
  ; do linux stuff
%endif

See: https://nasm.us/xdoc/2.15.03rc8/html/nasmdoc5.html for all the predefined variables.

Alignment

As you are writing your code, you may want instructions or data aligned on a word, dword, qword, or other size boundaries. Typical uses are to align code on word/dword/qword boundaries. You get a performance boost by having the target of a branching instruction such as jmp, call, and so on.

    align 8 ; align next code/data generated at next 8 byte boundary/address
	align 16 ; align next code/data at next 16 byte boundary
	
	db 'hello'
	align 8
my_code_is_aligned:

Alignment is also useful for data structure definitions so your assembly structs can match up with ones defined in C.

Structures

You can define high-level like structures using the %struc and %endstruc directives. The %struc directive takes one parameter, the name of the structure. The structure members are defined using the resb/resd/resw/resq space allocation directives. The align directives are used to align structure members on the desired boundaries.

%struc Contact
.company: resb 1 ; true for company, false for individual
   align 2
.company_id: resd 1 ; identifier
.name: resb 64 ; max 64 characters for name
.address: resb 64 ; also 64 for address
.phone: resb 16 ; 16 characters for phone number
%endstruc

Using a structure is straightforward:

   mov rsi, [person] ; fetch address of Contact struct into RSI
   mov al, [rsi+Contact.company]
   test al,al
   jne .company
   ; is an individual
   print "Person"
   push rsi
   mov rsi, [rsi+Contact.name]
   call printit
   pop rsi
   ...
.company:
   ; is a company
   print "Company"
   push rsi
   mov rsi, [rsi+Contact.name]
   call printit
   pop rsi
   ...

You use the %istruc and %iend directives to declare instances of structures.

a_company: istruc Contact
  at .company, db 1
  at .company_id, dd 100
  at .name, db 'Engulf and Devour Corp', 0
  at .address, db '1 Main Street, Anytown USA', 0
  at .phone, db '1-800-devour!', 0
%iend

Includes

NASM provides two commonly used include directives:

    %include "path/to/file"
    %incbin"path/to/file"

The %include directive works like the «C» #include directive — it simply reads the specified file in place and assembles it as if it were part of the current file. You can arbitrarily nest these includes, like you do in «C».

The %incbin directive includes a raw binary, verbatim, in the output file at the current position. You can use it, for example, to include a .gif file in your code:

my_gif:
   %incbin '/path/to/my/picture.gif'
my_gif_size equ $-my_gif

Hello, World

MacOS Version

See hello-world/ directory for a build script and this assembly source.

; Use the build-macos.sh script to assemble and link this.

        bits 64

		section .text

		global start
start:
		mov     rax, 0x2000004 ; write
		mov     rdi, 1 ; stdout
		mov     rsi, msg
		mov     rdx, msg.len
		syscall

		mov     rax, 0x2000001 ; exit
		mov     rdi, 0
		syscall


		section .data

msg:    db      "Hello, world!", 10
.len:   equ     $ - msg

It works. Here’s the output:

# ./build-mac.sh
Run it via ./hello-macos
# ./hello-macos
Hello, World!
#

Linux version

Linux has different (from MacOS) syscall numbers passed in rax. The entry point for Linux programs is «_start»» vs «start» on MacOS.

Otherwise, the program is the same.

; use the build-linux.sh script to assemble and link this
        bits 64

		section .text

        global _start
_start:
		mov     rax, 1 ; write
		mov     rdi, 1 ; stdout
		mov     rsi, msg
		mov     rdx, msg.len
		syscall

		mov     rax, 60 ; exit
		mov     rdi, 0
		syscall


		section .data

msg:    db      "Hello, world!", 10
.len:   equ     $ - msg

# ./build-linux.sh
Run it via ./hello-linux
i# ./hello-linux
Hello, world!
#

How it works

MacOS and Linux provide quite a few syscalls each, or operating system calls that we can call from any language. There are quite a few syscalls in common between the two, but they are different flavors of Unix (linux vs. BSD-ish/MacOS). The two flavors have several syscalls that are provided in one OS but not the other. The syscall numbers (passed in rax) are also different between the operating systems.

The C libraries contain code similar to our code above, to write strings to a file. For our purposes we use the file number for stdout to write to the console.

For most C calls that are not provided by a library or the standard C/C++ libraries, there is a syscall. For example, malloc and free are provided by libc so there is no syscall for it. However, sbrk() is not provided by the libraries and is provided as a syscall.

The syscalls take arguments in the CPU registers. RAX contains the syscall number (one for write, one for exit in the above).

Linux Syscalls

Linux syscalls are documented here:
https://chromium.googlesource.com/chromiumos/docs/+/master/constants/syscalls.md
The syscalls for Linux are defined in:
/usr/include/sys/syscall.h

MacOS Syscalls

The syscalls for MacOS are defined in:
./Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/syscall.h
These syscall numbers are subject to change, so you should, at least, use the defines in your syscall.h and realize that when you update your OS, you need to verify the numbers haven’t changed.

Alternatively, you can programatically scan the syscall.h file and generate assembly EQU for each syscall and always have the correct syscall numbers in your program.

If the parameters to the OS syscalls somehow change, your program will crash. It’s not likely every syscall is going to have these changes, but you will need to fix your code when this does happen.

Источник

В наше время редко возникает необходимость писать на чистом ассемблере, но я определённо рекомендую это всем, кто интересуется программированием. Вы увидите на вещи под иным углом, а навыки пригодятся при отладке кода на других языках.

$ ./calc "32+6*" # "(3+2)*6" в инфиксной нотации
30

Начнём с написания базовой программы Hello world! для проверки настроек среды. Затем перейдём к системным вызовам, стеку вызовов, стековым кадрам и соглашению о вызовах x86. Потом для практики напишем некоторые базовые функции на ассемблере x86 — и начнём писать калькулятор RPN.

Предполагается, что у читателя есть некоторый опыт программирования на C и базовые знания компьютерной архитектуры (например, что такое регистр процессора). Поскольку мы будем использовать Linux, вы также должны уметь использовать командную строку Linux.

Настройка среды

Как уже сказано, мы используем Linux (64- или 32-битный). Приведённый код не работает в Windows или Mac OS X.

$ sudo apt-get install binutils nasm

Я бы также рекомендовал держать под рукой таблицу ASCII.

Hello, world!

Для проверки среды сохраните следующий код в файле calc.asm:

; Компоновщик находит символ _start и начинает выполнение программы
; отсюда.
global _start

; В разделе .rodata хранятся константы (только для чтения)
; Порядок секций не имеет значения, но я люблю ставить её вперёд
section .rodata
    ; Объявляем пару байтов как hello_world. Псевдоинструкция базы NASM 
    ; допускает однобайтовое значение, строковую константу или их сочетание,
    ; как здесь. 0xA = новая строка, 0x0 = нуль окончания строки
    hello_world: db "Hello world!", 0xA, 0x0

; Начало секции .text, где находится код программы
section .text
_start:
    mov eax, 0x04           ; записать число 4 в регистр eax (0x04 = write())
    mov ebx, 0x1            ; дескриптор файла (1 = стандартный вывод, 2 = стандартная ошибка)
    mov ecx, hello_world    ; указатель на выводимую строку
    mov edx, 14             ; длина строки
    int 0x80                ; отправляем сигнал прерывания 0x80, который ОС
                            ;   интерпретирует как системный вызов

    mov eax, 0x01           ; 0x01 = exit()
    mov ebx, 0              ; 0 = нет ошибок
    int 0x80

Следующие команды собирают файл ассемблера в объектный файл, а затем компонует исполняемый файл:

$ nasm -f elf_i386 calc.asm -o calc
$ ld -m elf_i386 calc.o -o calc

После запуска вы должны увидеть:

$ ./calc
Hello world!

Makefile

CFLAGS= -f elf32
LFLAGS= -m elf_i386

all: calc

calc: calc.o
	ld $(LFLAGS) calc.o -o calc

calc.o: calc.asm
	nasm $(CFLAGS) calc.asm -o calc.o

clean:
	rm -f calc.o calc
        
.INTERMEDIATE: calc.o

Затем вместо вышеприведённых инструкций просто запускаем make.

Системные вызовы

syscall 0x01: exit(int error_code)
  error_code - используем 0 для выхода без ошибок и любые другие значения (такие как 1) для ошибок
syscall 0x04: write(int fd, char *string, int length)
  fd — используем 1 для стандартного вывода, 2 для стандартного потока вывода ошибок
  string — указатель на первый символ строки
  length — длина строки в байтах

eax	ebx	ecx	edx
Номер системного вызова	arg1	arg2	arg3

Стек вызовов

Соглашение о вызовах для архитектуры x86

Правила вызывающей стороны

Перед вызовом функции вызывающая сторона должна:

Сохранить в стек регистры, которые обязан сохранять вызывающий. Вызываемая функция может изменить некоторые регистры: чтобы не потерять данные, вызывающая сторона должна сохранить их в памяти до помещения в стек. Речь идёт о регистрах eax, ecx и edx. Если вы не используете какие-то из них, то их можно не сохранять.
Записать аргументы функции на стек в обратном порядке (сначала последний аргумент, в конце первый аргумент). Такой порядок гарантирует, что вызываемая функция получит из стека свои аргументы в правильном порядке.
Вызвать подпрограмму.

По возможности функция сохранит результат в eax. Сразу после call вызывающая сторона должна:

Удалить из стека аргументы функции. Обычно это делается путём простого добавления числа байтов в esp. Не забывайте, что стек растёт вниз, поэтому для удаления из стека необходимо добавить байты.
Восстановить сохранённые регистры, забрав их из стека в обратном порядке инструкцией pop. Вызываемая функция не изменит никакие другие регистры.

_mysubroutine:
    ; ...
    ; здесь какой-то код
    ; ...
    push ecx       ; сохраняем регистры (я решил не сохранять eax)
    push edx
    push 2         ; второе правило, пушим аргументы в обратном порядке
    push 10
    call _subtract ; eax теперь равен 10-2=8
    add esp, 8     ; удаляем 8 байт со стека (два аргумента по 4 байта)
    pop edx        ; восстанавливаем сохранённые регистры
    pop ecx
    ; ...
    ; ещё какой-то код, где я использую удивительно полезное значение из eax
    ; ...

Правила вызываемой подпрограммы

Перед вызовом подпрограмма должна:

Сохранить указатель базового регистра ebp предыдущего фрейма, записав его на стек.
Отрегулировать ebp с предыдущего фрейма на текущий (текущее значение esp).
Выделить больше места в стеке для локальных переменных, при необходимости переместить указатель esp. Поскольку стек растёт вниз, нужно вычесть недостающую память из esp.
Сохранить в стек регистры вызываемой подпрограммы. Это ebx, edi и esi. Необязательно сохранять регистры, которые не планируется изменять.

Стек вызовов после шага 1:

Стек вызовов после шага 2:

Стек вызовов после шага 4:

Восстановить сохранённые регистры, вынеся их из стека в обратном порядке.
Освободить место в стеке, выделенное локальным переменным на шаге 3, если необходимо: делается простой установкой esp в ebp
Восстановить указатель базы ebp предыдущего фрейма, вынеся его из стека.
Вернуться с помощью ret

Теперь реализуем функцию _subtract из нашего примера:

_subtract:
    push ebp           ; сохранение указателя базы предыдущего фрейма
    mov ebp, esp        ; настройка ebp
    ; Здесь я бы выделил место на стеке для локальных переменных, но они мне не нужны
    ; Здесь я бы сохранил регистры вызываемой подпрограммы, но я ничего не
    ; собираюсь изменять
    ; Тут начинается функция
    mov eax, [ebp+8]    ; копирование первого аргумента функции в eax. Скобки
                        ; означают доступ к памяти по адресу ebp+8
    sub eax, [ebp+12]   ; вычитание второго аргумента по адресу ebp+12 из первого 
                        ; аргумента
    ; Тут функция заканчивается, eax равен её возвращаемому значению
    ; Здесь я бы восстановил регистры, но они не сохранялись
    ; Здесь я бы освободил стек от переменных, но память для них не выделялась
    pop ebp             ; восстановление указателя базы предыдущего фрейма
    ret

Вход и выход

_subtract:
    enter 0, 0            ; сохранение указателя базы предыдущего фрейма и настройка ebp
    ; Здесь я бы сохранил регистры вызываемой подпрограммы, но я ничего не 
    ; собираюсь изменять
    ; Тут начинается функция
    mov eax, [ebp+8]    ; копирование первого аргумента функции в eax. Скобки
                        ; означают доступ к памяти по адресу ebp+8
    sub eax, [ebp+12]   ; вычитание второго аргумента по адресу ebp+12 из 
                        ; первого аргумента
    ; Тут функция заканчивается, eax равен её возвращаемому значению
    ; Здесь я бы восстановил регистры, но они не сохранялись
    leave              ; восстановление указателя базы предыдущего фрейма
    ret

Написание некоторых основных функций

Здесь понадобится ещё одна функция _strlen для подсчёта длины строки. На C она может выглядеть так:

size_t strlen(char *s) {
    size_t length = 0;
    while (*s != 0)
    {           // начало цикла
        length++;
        s++;
    }           // конец цикла
    return length;
}

_strlen:
    enter 0, 0          ; сохраняем указатель базы предыдущего фрейма и настраиваем ebp
    ; Здесь я бы сохранил регистры вызываемой подпрограммы, но я ничего не 
    ; собираюсь изменять
    ; Здесь начинается функция
    mov eax, 0          ; length = 0
    mov ecx, [ebp+8]    ; первый аргумент функции (указатель на первый
                        ; символ строки) копируется в ecx (его сохраняет вызывающая 
                        ; сторона, так что нам нет нужды сохранять)
_strlen_loop_start:     ; это метка, куда можно перейти
    cmp byte [ecx], 0   ; разыменование указателя и сравнение его с нулём. По
                        ; умолчанию память считывается по 32 бита (4 байта).
                        ; Иное нужно указать явно. Здесь мы указываем
                        ; чтение только одного байта (один символ)
    je _strlen_loop_end ; выход из цикла при появлении нуля
    inc eax             ; теперь мы внутри цикла, добавляем 1 к возвращаемому значению
    add ecx, 1          ; переход к следующему символу в строке
    jmp _strlen_loop_start  ; переход обратно к началу цикла
_strlen_loop_end:
    ; Здесь функция заканчивается, eax равно возвращаемому значению
    ; Здесь я бы восстановил регистры, но они не сохранялись
    leave               ; восстановление указателя базы предыдущего фрейма
    ret

_print_msg:
    enter 0, 0
    ; Здесь начинается функция
    mov eax, 0x04       ; 0x04 = системный вызов write()
    mov ebx, 0x1        ; 0x1 = стандартный вывод
    mov ecx, [ebp+8]    ; мы хотим вывести первый аргумент этой функции,
    ; сначала установим edx на длину строки. Пришло время вызвать _strlen
    push eax            ; сохраняем регистры вызываемой функции (я решил не сохранять edx)
    push ecx       
    push dword [ebp+8]  ; пушим аргумент _strlen в _print_msg. Здесь NASM
                        ; ругается, если не указать размер, не знаю, почему.
                        ; В любом случае указателем будет dword (4 байта, 32 бита)
    call _strlen        ; eax теперь равен длине строки
    mov edx, eax        ; перемещаем размер строки в edx, где он нам нужен
    add esp, 4          ; удаляем 4 байта со стека (один 4-байтовый аргумент char*)
    pop ecx             ; восстанавливаем регистры вызывающей стороны
    pop eax
    ; мы закончили работу с функцией _strlen, можно инициировать системный вызов
    int 0x80
    leave
    ret

И посмотрим плоды нашей тяжёлой работы, используя эту функцию в полной программе “Hello, world!”.

_start:
    enter 0, 0
    ; сохраняем регистры вызывающей стороны (я решил никакие не сохранять)
    push hello_world    ; добавляем аргумент для _print_msg
    call _print_msg
    mov eax, 0x01           ; 0x01 = exit()
    mov ebx, 0              ; 0 = без ошибок
    int 0x80

Создание стека

section .data
    stack_size: dd 0        ; создаём переменную dword (4 байта) со значением 0
    stack: times 256 dd 0   ; заполняем стек нулями

Теперь можно реализовать функции _push и _pop:

_push:
    enter 0, 0
    ; Сохраняем регистры вызываемой функции, которые будем использовать
    push eax
    push edx
    mov eax, [stack_size]
    mov edx, [ebp+8]
    mov [stack + 4*eax], edx    ; Заносим аргумент на стек. Масштабируем по
                                ; четыре байта в соответствии с размером dword
    inc dword [stack_size]      ; Добавляем 1 к stack_size
    ; Восстанавливаем регистры вызываемой функции
    pop edx
    pop eax
    leave
    ret

_pop:
    enter 0, 0
    ;  Сохраняем регистры вызываемой функции
    dec dword [stack_size]      ; Сначала вычитаем 1 из stack_size
    mov eax, [stack_size]
    mov eax, [stack + 4*eax]    ; Заносим число на верх стека в eax
    ; Здесь я бы восстановил регистры, но они не сохранялись
    leave
    ret

Вывод чисел

_pow_10:
    enter 0, 0
    mov ecx, [ebp+8]    ; задаёт ecx (сохранённый вызывающей стороной) аргументом 
                        ; функции
    mov eax, 1          ; первая степень 10 (10**0 = 1)
_pow_10_loop_start:     ; умножает eax на 10, если ecx не равно 0
    cmp ecx, 0
    je _pow_10_loop_end
    imul eax, 10
    sub ecx, 1
    jmp _pow_10_loop_start
_pow_10_loop_end:
    leave
    ret

_mod:
    enter 0, 0
    push ebx
    mov edx, 0          ; объясняется ниже
    mov eax, [ebp+8]
    mov ebx, [ebp+12]
    idiv ebx            ; делит 64-битное целое [edx:eax] на ebx. Мы хотим поделить
                        ; только 32-битное целое eax, так что устанавливаем edx равным 
                        ; нулю.
                        ; частное сохраняем в eax, остаток в edx. Как обычно, получить 
                        ; информацию по конкретной инструкции можно из справочников, 
                        ; перечисленных в конце статьи.
    mov eax, edx        ; возвращает остаток от деления (модуль)
    pop ebx
    leave
    ret

_putc:
    enter 0, 0
    mov eax, 0x04       ; write()
    mov ebx, 1          ; стандартный вывод
    lea ecx, [ebp+8]    ; входной символ
    mov edx, 1          ; вывести только 1 символ
    int 0x80
    leave
    ret

На C программа будет выглядеть примерно так:

#define MAX_DIGITS 10
void print_answer(int a) {
    if (a < 0) { // если число отрицательное
        putc('-'); // вывести знак «минус»
        a = -a; // преобразовать в положительное число
    }
    int started = 0;
    for (int i = MAX_DIGITS; i > 0; i--) {
        int digit = (a % pow_10(i)) / pow_10(i-1);
        if (digit == 0 && started == 0) continue; // не выводить лишние нули
        started = 1;
        putc(digit + '0');
    }
}

Теперь вы понимаете, зачем нам эти три функции. Давайте реализуем это на ассемблере:

%define MAX_DIGITS 10

_print_answer:
    enter 1, 0              ; используем 1 байт для переменной "started" в коде C
    push ebx
    push edi
    push esi
    mov eax, [ebp+8]        ; наш аргумент "a"
    cmp eax, 0              ; если число не отрицательное, пропускаем этот условный 
                            ; оператор
    jge _print_answer_negate_end
    ; call putc for '-'
    push eax
    push 0x2d               ; символ '-'
    call _putc
    add esp, 4
    pop eax
    neg eax                 ; преобразуем в положительное число
_print_answer_negate_end:
    mov byte [ebp-4], 0     ; started = 0
    mov ecx, MAX_DIGITS     ; переменная i
_print_answer_loop_start:
    cmp ecx, 0
    je _print_answer_loop_end
    ; вызов pow_10 для ecx. Попытаемся сделать ebx как переменную "digit" в коде C.
    ; Пока что назначим edx = pow_10(i-1), а ebx = pow_10(i)
    push eax
    push ecx
    dec ecx             ; i-1
    push ecx            ; первый аргумент для _pow_10
    call _pow_10
    mov edx, eax        ; edx = pow_10(i-1)
    add esp, 4
    pop ecx             ; восстанавливаем значение i для ecx
    pop eax
    ; end pow_10 call
    mov ebx, edx        ; digit = ebx = pow_10(i-1)
    imul ebx, 10        ; digit = ebx = pow_10(i)
    ; вызываем _mod для (a % pow_10(i)), то есть (eax mod ebx)
    push eax
    push ecx
    push edx
    push ebx            ; arg2, ebx = digit = pow_10(i)
    push eax            ; arg1, eax = a
    call _mod
    mov ebx, eax        ; digit = ebx = a % pow_10(i+1), almost there
    add esp, 8
    pop edx
    pop ecx
    pop eax
    ; завершение вызова mod
    ; делим ebx (переменная "digit" ) на pow_10(i) (edx). Придётся сохранить пару 
    ; регистров, потому что idiv использует для деления и edx, eax. Поскольку 
    ; edx является нашим делителем, переместим его в какой-нибудь 
    ; другой регистр
    push esi
    mov esi, edx
    push eax
    mov eax, ebx
    mov edx, 0
    idiv esi            ; eax хранит результат (цифру)
    mov ebx, eax        ; ebx = (a % pow_10(i)) / pow_10(i-1), переменная "digit" в коде C
    pop eax
    pop esi
    ; end division
    cmp ebx, 0                        ; если digit == 0
    jne _print_answer_trailing_zeroes_check_end
    cmp byte [ebp-4], 0               ; если started == 0
    jne _print_answer_trailing_zeroes_check_end
    jmp _print_answer_loop_continue   ; continue
_print_answer_trailing_zeroes_check_end:
    mov byte [ebp-4], 1     ; started = 1
    add ebx, 0x30           ; digit + '0'
    ; вызов putc
    push eax
    push ecx
    push edx
    push ebx
    call _putc
    add esp, 4
    pop edx
    pop ecx
    pop eax
    ; окончание вызова putc
_print_answer_loop_continue:
    sub ecx, 1
    jmp _print_answer_loop_start
_print_answer_loop_end:
    pop esi
    pop edi
    pop ebx
    leave
    ret

Теперь у нас есть все необходимые функции, осталось реализовать основную логику в _start — и на этом всё!

Вычисление обратной польской записи

Шаг	Символ	Стек перед	Стек после
1	`8`	`[]`	`[8]`
2	`4`	`[8]`	`[8, 4]`
3	`/`	`[8, 4]`	`[2]`
4	`3`	`[2]`	`[2, 3]`
5	`+`	`[2, 3]`	`[5]`
6	`6`	`[5]`	`[5, 6]`
7	`*`	`[5, 6]`	`[30]`

В ассемблере нужно реализовать нечто вроде такого кода на C:

int stack[256];         // наверное, 256 слишком много для нашего стека
int stack_size = 0;

int main(int argc, char *argv[]) {
    char *input = argv[0];
    size_t input_length = strlen(input);
    
    for (int i = 0; i < input_length; i++) {
        char c = input[i];
        if (c >= '0' && c <= '9') { // если символ — это цифра
            push(c - '0'); // преобразовать символ в целое число и поместить в стек
        } else {
            int b = pop();
            int a = pop();
            if (c == '+') {
                push(a+b);
            } else if (c == '-') {
                push(a-b);
            } else if (c == '*') {
                push(a*b);
            } else if (c == '/') {
                push(a/b);
            } else {
                error("Invalid inputn");
                exit(1);
            }
        }
    }
    
    if (stack_size != 1) {
        error("Invalid inputn");
        exit(1);
    }
    
    print_answer(stack[0]);
    exit(0);
}

Теперь у нас имеются все функции, необходимые для реализации этого, давайте начнём.

_start:
    ; аргументы _start получаются не так, как в других функциях.
    ; вместо этого esp указывает непосредственно на argc (число аргументов), а 
    ; esp+4 указывает на argv. Следовательно, esp+4 указывает на название
    ; программы, esp+8 - на первый аргумент и так далее
    mov esi, [esp+8]         ; esi = "input" = argv[0]
    ; вызываем _strlen для определения размера входных данных
    push esi
    call _strlen
    mov ebx, eax             ; ebx = input_length
    add esp, 4
    ; end _strlen call
    mov ecx, 0               ; ecx = "i"
_main_loop_start:
    cmp ecx, ebx             ; если (i >= input_length)
    jge _main_loop_end
    mov edx, 0
    mov dl, [esi + ecx]      ; то загрузить один байт из памяти в нижний байт
                             ; edx. Остальную часть edx обнуляем.
                             ; edx = переменная c = input[i]
    cmp edx, '0'
    jl _check_operator
    cmp edx, '9'
    jg _print_error
    sub edx, '0'
    mov eax, edx             ; eax = переменная c - '0' (цифра, не символ)
    jmp _push_eax_and_continue
_check_operator:
    ; дважды вызываем _pop для выноса переменной b в edi, a переменной b - в eax
    push ecx
    push ebx
    call _pop
    mov edi, eax             ; edi = b
    call _pop                ; eax = a
    pop ebx
    pop ecx
    ; end call _pop
    cmp edx, '+'
    jne _subtract
    add eax, edi                 ; eax = a+b
    jmp _push_eax_and_continue
_subtract:
    cmp edx, '-'
    jne _multiply
    sub eax, edi                 ; eax = a-b
    jmp _push_eax_and_continue
_multiply:
    cmp edx, '*'
    jne _divide
    imul eax, edi                ; eax = a*b
    jmp _push_eax_and_continue
_divide:
    cmp edx, '/'
    jne _print_error
    push edx                     ; сохраняем edx, потому что регистр обнулится для idiv
    mov edx, 0
    idiv edi                     ; eax = a/b
    pop edx
    ; теперь заносим eax на стек и продолжаем
_push_eax_and_continue:
    ; вызываем _push
    push eax
    push ecx
    push edx
    push eax          ; первый аргумент
    call _push
    add esp, 4
    pop edx
    pop ecx
    pop eax
    ; завершение call _push
    inc ecx
    jmp _main_loop_start
_main_loop_end:
    cmp byte [stack_size], 1      ; если (stack_size != 1), печать ошибки
    jne _print_error
    mov eax, [stack]
    push eax
    call _print_answer
    ; print a final newline
    push 0xA
    call _putc
    ; exit successfully
    mov eax, 0x01           ; 0x01 = exit()
    mov ebx, 0              ; 0 = без ошибок
    int 0x80                ; здесь выполнение завершается
_print_error:
    push error_msg
    call _print_msg
    mov eax, 0x01
    mov ebx, 1
    int 0x80

Понадобится ещё добавить строку error_msg в раздел .rodata:

section .rodata
    ; Назначаем на некоторые байты error_msg. Псевдоинструкция db в NASM
    ; позволяет использовать однобайтовое значение, строковую константу или их 
    ; сочетание. 0xA = новая строка, 0x0 = нуль окончания строки
    error_msg: db "Invalid input", 0xA, 0x0

Весь код здесь. Спасибо за чтение! Могу продолжить, если вам интересно.

Дальнейшие действия

Можете попрактиковаться, реализовав несколько дополнительных функций:

Выдать вместо segfault сообщение об ошибке, если программа не получает аргумент.
Добавить поддержку дополнительных пробелов между операндами и операторами во входных данных.
Добавить поддержку многоразрядных операндов.
Разрешить ввод отрицательных чисел.
Заменить _strlen на функцию из стандартной библиотеки C, а _print_answer заменить вызовом printf.

Дополнительные материалы

«Руководство по ассемблеру x86 университета Вирджинии» — более подробное изложение многих тем, рассмотренных нами, в том числе дополнительная информация по всем популярным инструкциям x86.
«Искусство выбора регистров Intel». Хотя большинство регистров x86 — регистры общего назначения, но у многих есть историческое значение. Следование этим соглашениям может улучшить читаемость кода и, как интересный побочный эффект, даже немного оптимизировать размер двоичных файлов.
NASM: Intel x86 Instruction Reference — полное руководство по всем малоизвестным инструкциям x86.

Автор: m1rko

Источник

Источник