Language Models Can Disassemble Code
OpenAI Codex seems to have some understanding of assembly language. It can even disassemble.
Assembly
I prompted OpenAI Codex to get it to invent a simple test C program and gave instructions on compiling it to assembler code, and then had it autocomplete what it believed the result would be. Here are the results.
You can use the -S option to gcc to produce assembly code.
gcc -S source.c
This will produce a file called source.s.
For example, suppose that source.c contains the following code:
#include <stdio.h>
int main(void)
{
int a = 1;
int b = 2;
printf("%d\n", a + b);
return 0;
}
If you compile this with gcc -S source.c, you will get the following assembly code in the file source.s:
.file "source.c"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $1, -4(%rbp)
movl $2, -8(%rbp)
movl -4(%rbp), %eax
addl -8(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
.section .note.GNU-stack,"",@progbits
Let’s explain roughly how this works, and some of the confusing notation and labels. Note: I wrote this line, and then it autocompleted the following explanation:
First, we see that the main function is defined:
.globl main
.type main, @function
main:
The .globl directive tells the assembler that the main symbol is global, i.e. that it can be accessed from other files.
The .type directive tells the assembler that the symbol main is a function.
The main: label tells the assembler that the next few lines of code are the definition of the main function.
Next, we see the following:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
This is the function prologue.
The function prologue is code that is executed before the function body.
The .LFB0: label is a local label that is used to mark the beginning of the function prologue.
The .cfi_startproc directive tells the assembler that the next few lines are the function prologue.
The pushq %rbp instruction pushes the value of the rbp register onto the stack.
The .cfi_def_cfa_offset 16 directive tells the assembler that the stack pointer is 16 bytes away from the current frame pointer.
The .cfi_offset 6, -16 directive tells the assembler that the current frame pointer is 16 bytes away from the previous frame pointer.
The movq %rsp, %rbp instruction moves the value of the stack pointer into the frame pointer.
The .cfi_def_cfa_register 6 directive tells the assembler that the current frame pointer is rbp.
Next, we see the following:
movl $1, -4(%rbp)
movl $2, -8(%rbp)
These instructions move the values 1 and 2 into the variables a and b respectively.
Next, we see the following:
movl -4(%rbp), %eax
addl -8(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
These instructions move the value of a into the eax register, add the value of b to the eax register, move the result into the esi register, move the address of the string "%d\n" into the edi register, move the value 0 into the eax register, and call the printf function.
Next, we see the following:
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
This is the function epilogue.
The function epilogue is code that is executed after the function body.
The movl $0, %eax instruction moves the value 0 into the eax register.
The popq %rbp instruction pops the value of the rbp register off of the stack.
The .cfi_def_cfa 7, 8 directive tells the assembler that the stack pointer is 8 bytes away from the current frame pointer.
The ret instruction returns from the function, and the .cfi_endproc directive tells the assembler that the function epilogue ends here.
Finally, we see the following:
.LFE0:
.size main, .-main
The .LFE0: label is a local label that is used to mark the end of the function.
The .size directive tells the assembler that the size of the main function is the difference between the main label and the .LFE0 label.
Dissassembly
Human writing.
Let’s talk about disassembly. For this one, I wrote (by hand) the following source.c file:
#include <stdio.h>
int main(void)
{
int a = 1;
int b = 1;
for (int i = 0; i < 10; ++ i) {
int c = a + b;
a = b;
b = c;
}
printf("%d\n", a + b);
return 0;
}
We compile it to assembly with gcc -S source.c, yielding source.s:
.file "source.c"
.text
.section .rodata
.LC0:
.string "%d\n"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $1, -16(%rbp)
movl $1, -12(%rbp)
movl $0, -8(%rbp)
jmp .L2
.L3:
movl -16(%rbp), %edx
movl -12(%rbp), %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -12(%rbp), %eax
movl %eax, -16(%rbp)
movl -4(%rbp), %eax
movl %eax, -12(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $9, -8(%rbp)
jle .L3
movl -16(%rbp), %edx
movl -12(%rbp), %eax
addl %edx, %eax
movl %eax, %esi
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf@PLT
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
Note: to clarify, the above listing was from gcc, not OpenAI codex. We are going to have it try to reverse-engineer it. This is called disassembly. To this end I prompted the OpenAI codex with the compilation instruction, the above assembly code listing, and the next two paragraphs, and you can see the results:
By inspecting this and understanding what is happening, we can reconstruct a possible C program source.c. This is called disassembly.
Here is one possible value for the file source.c:
#include <stdio.h>
int main() {
int a = 1;
int b = 1;
for (int i = 0; i < 9; i++) {
int temp = a;
a = b;
b = temp + b;
}
printf("%d\n", a + b);
}
Looks like it made an off-by-one error. Stupid computer!