It’s often said that assembly language is complex. Most people are scared of it, everyone avoids it.
After all, there’s a reason why high-level languages and compilers were invented, right?
But while it’s true that you would have a hard time writing a large project in assembly,
the language itself is surprisingly simple.
That’s because Assembly is the native language of the processor,
and at it’s essence, all the processor does is moving data.
This guide is not about writing assembly; it’s about understanding the way data moves behind the scenes when you execute a program. We’ll use concrete examples for the x86-64 architecture, but these informations apply eveywhere and are foundamental knowledge for reverse engineering, binary exploitation, or just writing better code.
This is the first part of a series of interactive articles:
- introduction (you are here)
- moving data
- stack frames
what is data?
Data is just bits, representing information. A sequence of bits can encode any kind of information, from simple numbers to even audio and images. This article however will only focus on simple foundamentals: text and integers.
But before we talk about any kind of encoding, we have to introduce a new notation: The issue is that while circuits understand sequences of bits very well, humans don’t. For example, can you tell the difference between 1101010101111110 and 1101010101111110 ?
Show answerOk, the two sequences are identical, but I bet you couldn’t immediately see that.
In order to visualize binary data in a more human friendly way, we use
hexadecimal numbers, which associate a number or a letter
between A and F to a group of 4 bits.
A long sequence of bits can be represented in this way:
Note that in order to avoid confusion with decimal numbers, it’s common to prefix
hexadecimal numbers with 0x.
For example, 0x1234 is not the same
thing as the decimal number 1234.
I’m not going to explain how conversions between decimal, binary, and hexadecimal numbers work,
The only assumption i’m making in this article is that you know that.
If you have a python terminal, you can perform these conversions very easily:
In the rest of this article you will see a lot of hexadecimal numbers, mostly usued to represent long sequences of bits.
It’s common to encounter sequences of specific lengths.
For example, you probably already know that a sequence of 8 bits is called a byte.
This is an example of byte, in binary: 00100001 and this is the same byte, represented in hexadecimal: 0x21
In the x86-64 assembly / architecture, which is the focus of this article, there are other sequences of specific
lengths of bits that are used extensively.
You can see them summarized in this table:
4 | 0xf | nibble |
8 | 0xff | byte |
16 | 0xffff | word |
32 | 0xfffffff | dword (double word) |
64 | 0xfffffffffffff | qword (quadruple word) |
text
There are a lot of different ways to encode text, and I recommend that you read the bare minimum foundamentals , it’s a very interesting topic in itself. In this article however we’ll only focus on ASCII encoding, which is extremely simple:
All you need to know is that text is stored as a seqence of bytes. every byte represents a character, so there are 127 possible characters between numbers, english letters and puctuation. You can find a table of all the ascii characters in the linux man pages.
For example, the letter ‘c’ is stored as the byte 0x63, The letter ‘o’ is 0x6f, The text ciao is stored as the sequence of bytes 63 69 61 6f.
where is data?
Now that we know how to represent text and numbers, we need some place to store them. Like all kind of data, we can store it in only two places:
- in memory, which means in your RAM
- in registers, which are special containers inside your CPU
memory
Memory is just a very long list of contiguous cells, each containing 8 bits of information, and reachable by a numeric address.
Since printing a long list of bytes would take a lot of space, when visualizing memory we usually group bytes in rows of 8 or 16. It’s also common to include a column to the side that shows the ascii letter associated to each byte.
The memory dump below was taken from a program that was running on my computer. Use the slider to adjust the number of bytes you wanto to show in a row.
00000000
00000001
00000002
00000003
00000004
00000005
00000006
00000007
00000008
00000009
0000000a
0000000b
0000000c
0000000d
0000000e
0000000f
00000010
00000011
00000012
00000013
00000014
00000015
00000016
00000017
00000018
00000019
0000001a
0000001b
0000001c
0000001d
0000001e
0000001f
00000020
00000021
00000022
00000023
00000024
00000025
00000026
00000027
00000028
00000029
0000002a
0000002b
0000002c
0000002d
0000002e
0000002f
00000030
00000031
00000032
00000033
00000034
00000035
00000036
00000037
00000038
00000039
0000003a
0000003b
0000003c
0000003d
0000003e
0000003f
00000040
00000041
00000042
00000043
00000044
00000045
00000046
00000047
00000048
00000049
0000004a
0000004b
0000004c
0000004d
0000004e
0000004f
00000050
00000051
00000052
00000053
00000054
00000055
00000056
00000057
00000058
00000059
0000005a
0000005b
0000005c
0000005d
0000005e
0000005f
00000060
00000061
00000062
00000063
00000064
00000065
00000066
00000067
00000068
00000069
0000006a
0000006b
0000006c
0000006d
0000006e
0000006f
00000070
00000071
00000072
00000073
00000074
00000075
00000076
00000077
00000078
00000079
0000007a
0000007b
0000007c
0000007d
0000007e
0000007f
00000080
00000081
00000082
00000083
00000084
00000085
00000086
00000087
00000088
00000089
0000008a
0000008b
0000008c
0000008d
0000008e
0000008f
00000090
00000091
00000092
00000093
00000094
00000095
00000096
00000097
00000098
00000099
0000009a
0000009b
0000009c
0000009d
0000009e
0000009f
000000a0
000000a1
000000a2
000000a3
000000a4
000000a5
000000a6
000000a7
000000a8
000000a9
000000aa
000000ab
000000ac
000000ad
000000ae
000000af
000000b0
000000b1
000000b2
000000b3
000000b4
000000b5
000000b6
000000b7
000000b8
000000b9
000000ba
000000bb
000000bc
000000bd
000000be
000000bf
000000c0
000000c1
000000c2
000000c3
000000c4
000000c5
000000c6
000000c7
000000c8
000000c9
000000ca
000000cb
000000cc
000000cd
000000ce
000000cf
000000d0
000000d1
000000d2
000000d3
000000d4
000000d5
000000d6
000000d7
000000d8
000000d9
000000da
000000db
000000dc
000000dd
000000de
000000df
6578616d706c652061736369692074657874000000000000e95155555555000040dcffff0100000058dcffffff7f00000000000000000000e804be1278e96fe058dcffffff7f0000e951555555550000987d55555555000040d0fff7ff7f0000e8041ca48716901fe8043428fd06901f00000000ff7f0000000000000000000000000000000000000000000000000000000000000000000000429e875dca2f7e0000000000000000409ec2f7ff7f000068dcffffff7f0000987d555555550000e0e2fff7ff7f0000000000000000000000000000000000000051555555550000
example ascii text.......QUUUU..@.......X...................x.o.X........QUUUU...}UUUU..@.................4(.............................................B..]./~........@.......h........}UUUU...........................QUUUU..
registers
Registers are containers for data, located inside your CPU. The x86-64 architecture has a lot of registers, each with an associated name. Some of them have a specific purpose, other are generic containers we can use in our program. We mostly interact with these:
In order to understand these tables, we’ll look at the register rax, displayed in the first row. rax is a generic register that contains 8 bytes of data: from byte 0 to byte 7 as indicated by the byte numbers at the top of the table.
The register eax gives you access to the
lower 4 bytes of rax; reading or writing into eax is the same as reading or writing
the bytes from 0 to 3 of rax.
Similarly, ax gives you access to the lower 2 bytes, and al to the lowest byte.
Finally, some code
We are assuming that you are familiar with some programming language, it doesn’t matter which one. Assembly code syntax is similar to the programming language concepts you know: a sequence of instructions, usually one on every line, that will be executed in order.
The x86-64 assembly syntax has two different dialects: AT&T and Intel. All the code snippets in this series of articles are using the Intel syntax. The following snippet is an example of how the syntax looks like, don’t worry about what it does for now.
push rbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov eax, DWORD PTR [rbp-4] add eax, 0x42 pop rbp retA good way to familiarize yourself with the syntax is to look at the assembly generated from small snippets of code. The compiler explorer website is designed exactly for this use case: You can type snippets of code in any compiled language you know, and observe the generated assembly. If you hover the mouse over an assembly instruction you can even see a description of what it does.
In the next article we are going to see in details how each of the instruction in the previous example works
Further Reading
This article is still under development, and it’s improving over time.
If you reached this point, you might be interested in the next articles:
- introduction (you are here)
- moving data
- stack frames
Additional resources:
- pwn.college’s assembly module and lectures https://pwn.college/fundamentals/assembly-crash-course
- the compiler explorer website https://godbolt.org/z/c6brc1df9
- the official x86_64 reference
- unofficial x86_64 instructions reference https://www.felixcloutier.com/x86/
- The best linux syscall table reference https://syscalls.mebeim.net/?table=x86/64/x64/latest