An interactive guide to x86-64 assembly

5 hours ago 1

It’s often said that assembly language is complex. Most people are scared of it, everyone avoids it. After all, there’s a reason why high-level languages and compilers were invented, right?
But while it’s true that you would have a hard time writing a large project in assembly, the language itself is surprisingly simple. That’s because Assembly is the native language of the processor, and at it’s essence, all the processor does is moving data.

This guide is not about writing assembly; it’s about understanding the way data moves behind the scenes when you execute a program. We’ll use concrete examples for the x86-64 architecture, but these informations apply eveywhere and are foundamental knowledge for reverse engineering, binary exploitation, or just writing better code.

This is the first part of a series of interactive articles:

what is data?

Data is just bits, representing information. A sequence of bits can encode any kind of information, from simple numbers to even audio and images. This article however will only focus on simple foundamentals: text and integers.

But before we talk about any kind of encoding, we have to introduce a new notation: The issue is that while circuits understand sequences of bits very well, humans don’t. For example, can you tell the difference between 1101010101111110 and 1101010101111110 ?

Show answer

Ok, the two sequences are identical, but I bet you couldn’t immediately see that.

In order to visualize binary data in a more human friendly way, we use hexadecimal numbers, which associate a number or a letter between A and F to a group of 4 bits.
A long sequence of bits can be represented in this way:

0010 0101 0111 1101 1111 2 5 7 d f

Note that in order to avoid confusion with decimal numbers, it’s common to prefix hexadecimal numbers with 0x. For example, 0x1234 is not the same thing as the decimal number 1234.
I’m not going to explain how conversions between decimal, binary, and hexadecimal numbers work, The only assumption i’m making in this article is that you know that.
If you have a python terminal, you can perform these conversions very easily:

al@thinkpad:~/$ python >>> >>> 0b0010 2 >>> 0x1234 4660 >>> hex(0b00100101011111011111) '0x257df' >>> hex(4660) '0x1234'

In the rest of this article you will see a lot of hexadecimal numbers, mostly usued to represent long sequences of bits.

It’s common to encounter sequences of specific lengths. For example, you probably already know that a sequence of 8 bits is called a byte. This is an example of byte, in binary: 00100001 and this is the same byte, represented in hexadecimal: 0x21
In the x86-64 assembly / architecture, which is the focus of this article, there are other sequences of specific lengths of bits that are used extensively. You can see them summarized in this table:

N. of bitsexample hex valuename
40xfnibble
80xffbyte
160xffffword
320xfffffffdword (double word)
640xfffffffffffffqword (quadruple word)

text

There are a lot of different ways to encode text, and I recommend that you read the bare minimum foundamentals , it’s a very interesting topic in itself. In this article however we’ll only focus on ASCII encoding, which is extremely simple:

All you need to know is that text is stored as a seqence of bytes. every byte represents a character, so there are 127 possible characters between numbers, english letters and puctuation. You can find a table of all the ascii characters in the linux man pages.

For example, the letter ‘c’ is stored as the byte 0x63, The letter ‘o’ is 0x6f, The text ciao is stored as the sequence of bytes 63 69 61 6f.

where is data?

Now that we know how to represent text and numbers, we need some place to store them. Like all kind of data, we can store it in only two places:

  • in memory, which means in your RAM
  • in registers, which are special containers inside your CPU

memory

Memory is just a very long list of contiguous cells, each containing 8 bits of information, and reachable by a numeric address.

Since printing a long list of bytes would take a lot of space, when visualizing memory we usually group bytes in rows of 8 or 16. It’s also common to include a column to the side that shows the ascii letter associated to each byte.

The memory dump below was taken from a program that was running on my computer. Use the slider to adjust the number of bytes you wanto to show in a row.

00000000

00000001

00000002

00000003

00000004

00000005

00000006

00000007

00000008

00000009

0000000a

0000000b

0000000c

0000000d

0000000e

0000000f

00000010

00000011

00000012

00000013

00000014

00000015

00000016

00000017

00000018

00000019

0000001a

0000001b

0000001c

0000001d

0000001e

0000001f

00000020

00000021

00000022

00000023

00000024

00000025

00000026

00000027

00000028

00000029

0000002a

0000002b

0000002c

0000002d

0000002e

0000002f

00000030

00000031

00000032

00000033

00000034

00000035

00000036

00000037

00000038

00000039

0000003a

0000003b

0000003c

0000003d

0000003e

0000003f

00000040

00000041

00000042

00000043

00000044

00000045

00000046

00000047

00000048

00000049

0000004a

0000004b

0000004c

0000004d

0000004e

0000004f

00000050

00000051

00000052

00000053

00000054

00000055

00000056

00000057

00000058

00000059

0000005a

0000005b

0000005c

0000005d

0000005e

0000005f

00000060

00000061

00000062

00000063

00000064

00000065

00000066

00000067

00000068

00000069

0000006a

0000006b

0000006c

0000006d

0000006e

0000006f

00000070

00000071

00000072

00000073

00000074

00000075

00000076

00000077

00000078

00000079

0000007a

0000007b

0000007c

0000007d

0000007e

0000007f

00000080

00000081

00000082

00000083

00000084

00000085

00000086

00000087

00000088

00000089

0000008a

0000008b

0000008c

0000008d

0000008e

0000008f

00000090

00000091

00000092

00000093

00000094

00000095

00000096

00000097

00000098

00000099

0000009a

0000009b

0000009c

0000009d

0000009e

0000009f

000000a0

000000a1

000000a2

000000a3

000000a4

000000a5

000000a6

000000a7

000000a8

000000a9

000000aa

000000ab

000000ac

000000ad

000000ae

000000af

000000b0

000000b1

000000b2

000000b3

000000b4

000000b5

000000b6

000000b7

000000b8

000000b9

000000ba

000000bb

000000bc

000000bd

000000be

000000bf

000000c0

000000c1

000000c2

000000c3

000000c4

000000c5

000000c6

000000c7

000000c8

000000c9

000000ca

000000cb

000000cc

000000cd

000000ce

000000cf

000000d0

000000d1

000000d2

000000d3

000000d4

000000d5

000000d6

000000d7

000000d8

000000d9

000000da

000000db

000000dc

000000dd

000000de

000000df

6578616d706c652061736369692074657874000000000000e95155555555000040dcffff0100000058dcffffff7f00000000000000000000e804be1278e96fe058dcffffff7f0000e951555555550000987d55555555000040d0fff7ff7f0000e8041ca48716901fe8043428fd06901f00000000ff7f0000000000000000000000000000000000000000000000000000000000000000000000429e875dca2f7e0000000000000000409ec2f7ff7f000068dcffffff7f0000987d555555550000e0e2fff7ff7f0000000000000000000000000000000000000051555555550000

example ascii text.......QUUUU..@.......X...................x.o.X........QUUUU...}UUUU..@.................4(.............................................B..]./~........@.......h........}UUUU...........................QUUUU..


registers

Registers are containers for data, located inside your CPU. The x86-64 architecture has a lot of registers, each with an associated name. Some of them have a specific purpose, other are generic containers we can use in our program. We mostly interact with these:


In order to understand these tables, we’ll look at the register rax, displayed in the first row. rax is a generic register that contains 8 bytes of data: from byte 0 to byte 7 as indicated by the byte numbers at the top of the table.

The register eax gives you access to the lower 4 bytes of rax; reading or writing into eax is the same as reading or writing the bytes from 0 to 3 of rax.
Similarly, ax gives you access to the lower 2 bytes, and al to the lowest byte.

Finally, some code

We are assuming that you are familiar with some programming language, it doesn’t matter which one. Assembly code syntax is similar to the programming language concepts you know: a sequence of instructions, usually one on every line, that will be executed in order.

The x86-64 assembly syntax has two different dialects: AT&T and Intel. All the code snippets in this series of articles are using the Intel syntax. The following snippet is an example of how the syntax looks like, don’t worry about what it does for now.

push rbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov eax, DWORD PTR [rbp-4] add eax, 0x42 pop rbp ret

A good way to familiarize yourself with the syntax is to look at the assembly generated from small snippets of code. The compiler explorer website is designed exactly for this use case: You can type snippets of code in any compiled language you know, and observe the generated assembly. If you hover the mouse over an assembly instruction you can even see a description of what it does.

In the next article we are going to see in details how each of the instruction in the previous example works

Further Reading

This article is still under development, and it’s improving over time.
If you reached this point, you might be interested in the next articles:

Additional resources:

Read Entire Article