Language Models Can Disassemble Code
OpenAI Codex seems to have some understanding of assembly language. It can even disassemble.
Assembly
I prompted OpenAI Codex to get it to invent a simple test C program and gave instructions on compiling it to assembler code, and then had it autocomplete what it believed the result would be. Here are the results.
You can use the -S
option to gcc
to produce assembly code.
gcc -S source.c
This will produce a file called source.s
.
For example, suppose that source.c
contains the following code:
#include <stdio.h>
int main(void)
{
int a = 1;
int b = 2;
printf("%d\n", a + b);
return 0;
}
If you compile this with gcc -S source.c
, you will get the following assembly code in the file source.s
:
.file "source.c"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $1, -4(%rbp)
movl $2, -8(%rbp)
movl -4(%rbp), %eax
addl -8(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
.section .note.GNU-stack,"",@progbits
Let’s explain roughly how this works, and some of the confusing notation and labels. Note: I wrote this line, and then it autocompleted the following explanation:
First, we see that the main
function is defined:
.globl main
.type main, @function
main:
The .globl
directive tells the assembler that the main
symbol is global, i.e. that it can be accessed from other files.
The .type
directive tells the assembler that the symbol main
is a function.
The main:
label tells the assembler that the next few lines of code are the definition of the main
function.
Next, we see the following:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
This is the function prologue.
The function prologue is code that is executed before the function body.
The .LFB0:
label is a local label that is used to mark the beginning of the function prologue.
The .cfi_startproc
directive tells the assembler that the next few lines are the function prologue.
The pushq %rbp
instruction pushes the value of the rbp
register onto the stack.
The .cfi_def_cfa_offset 16
directive tells the assembler that the stack pointer is 16 bytes away from the current frame pointer.
The .cfi_offset 6, -16
directive tells the assembler that the current frame pointer is 16 bytes away from the previous frame pointer.
The movq %rsp, %rbp
instruction moves the value of the stack pointer into the frame pointer.
The .cfi_def_cfa_register 6
directive tells the assembler that the current frame pointer is rbp
.
Next, we see the following:
movl $1, -4(%rbp)
movl $2, -8(%rbp)
These instructions move the values 1 and 2 into the variables a
and b
respectively.
Next, we see the following:
movl -4(%rbp), %eax
addl -8(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
These instructions move the value of a
into the eax
register, add the value of b
to the eax
register, move the result into the esi
register, move the address of the string "%d\n"
into the edi
register, move the value 0 into the eax
register, and call the printf
function.
Next, we see the following:
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
This is the function epilogue.
The function epilogue is code that is executed after the function body.
The movl $0, %eax
instruction moves the value 0 into the eax
register.
The popq %rbp
instruction pops the value of the rbp
register off of the stack.
The .cfi_def_cfa 7, 8
directive tells the assembler that the stack pointer is 8 bytes away from the current frame pointer.
The ret
instruction returns from the function, and the .cfi_endproc
directive tells the assembler that the function epilogue ends here.
Finally, we see the following:
.LFE0:
.size main, .-main
The .LFE0:
label is a local label that is used to mark the end of the function.
The .size
directive tells the assembler that the size of the main
function is the difference between the main
label and the .LFE0
label.
Dissassembly
Human writing.
Let’s talk about disassembly. For this one, I wrote (by hand) the following source.c
file:
#include <stdio.h>
int main(void)
{
int a = 1;
int b = 1;
for (int i = 0; i < 10; ++ i) {
int c = a + b;
a = b;
b = c;
}
printf("%d\n", a + b);
return 0;
}
We compile it to assembly with gcc -S source.c
, yielding source.s
:
.file "source.c"
.text
.section .rodata
.LC0:
.string "%d\n"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $1, -16(%rbp)
movl $1, -12(%rbp)
movl $0, -8(%rbp)
jmp .L2
.L3:
movl -16(%rbp), %edx
movl -12(%rbp), %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -12(%rbp), %eax
movl %eax, -16(%rbp)
movl -4(%rbp), %eax
movl %eax, -12(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $9, -8(%rbp)
jle .L3
movl -16(%rbp), %edx
movl -12(%rbp), %eax
addl %edx, %eax
movl %eax, %esi
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf@PLT
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
Note: to clarify, the above listing was from gcc, not OpenAI codex. We are going to have it try to reverse-engineer it. This is called disassembly. To this end I prompted the OpenAI codex with the compilation instruction, the above assembly code listing, and the next two paragraphs, and you can see the results:
By inspecting this and understanding what is happening, we can reconstruct a possible C program source.c
. This is called disassembly.
Here is one possible value for the file source.c
:
#include <stdio.h>
int main() {
int a = 1;
int b = 1;
for (int i = 0; i < 9; i++) {
int temp = a;
a = b;
b = temp + b;
}
printf("%d\n", a + b);
}
Looks like it made an off-by-one error. Stupid computer!