5 minute read

OpenAI Codex seems to have some understanding of assembly language. It can even disassemble.

Assembly

I prompted OpenAI Codex to get it to invent a simple test C program and gave instructions on compiling it to assembler code, and then had it autocomplete what it believed the result would be. Here are the results.

You can use the -S option to gcc to produce assembly code.

gcc -S source.c

This will produce a file called source.s.

For example, suppose that source.c contains the following code:

#include <stdio.h>

int main(void)
{
    int a = 1;
    int b = 2;

    printf("%d\n", a + b);
    return 0;
}

If you compile this with gcc -S source.c, you will get the following assembly code in the file source.s:

    .file   "source.c"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $1, -4(%rbp)
    movl    $2, -8(%rbp)
    movl    -4(%rbp), %eax
    addl    -8(%rbp), %eax
    movl    %eax, %esi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    printf
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
    .section    .note.GNU-stack,"",@progbits

Let’s explain roughly how this works, and some of the confusing notation and labels. Note: I wrote this line, and then it autocompleted the following explanation:

First, we see that the main function is defined:

    .globl  main
    .type   main, @function
main:

The .globl directive tells the assembler that the main symbol is global, i.e. that it can be accessed from other files.

The .type directive tells the assembler that the symbol main is a function.

The main: label tells the assembler that the next few lines of code are the definition of the main function.

Next, we see the following:

.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6

This is the function prologue.

The function prologue is code that is executed before the function body.

The .LFB0: label is a local label that is used to mark the beginning of the function prologue.

The .cfi_startproc directive tells the assembler that the next few lines are the function prologue.

The pushq %rbp instruction pushes the value of the rbp register onto the stack.

The .cfi_def_cfa_offset 16 directive tells the assembler that the stack pointer is 16 bytes away from the current frame pointer.

The .cfi_offset 6, -16 directive tells the assembler that the current frame pointer is 16 bytes away from the previous frame pointer.

The movq %rsp, %rbp instruction moves the value of the stack pointer into the frame pointer.

The .cfi_def_cfa_register 6 directive tells the assembler that the current frame pointer is rbp.

Next, we see the following:

    movl    $1, -4(%rbp)
    movl    $2, -8(%rbp)

These instructions move the values 1 and 2 into the variables a and b respectively.

Next, we see the following:

    movl    -4(%rbp), %eax
    addl    -8(%rbp), %eax
    movl    %eax, %esi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    printf

These instructions move the value of a into the eax register, add the value of b to the eax register, move the result into the esi register, move the address of the string "%d\n" into the edi register, move the value 0 into the eax register, and call the printf function.

Next, we see the following:

    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

This is the function epilogue.

The function epilogue is code that is executed after the function body.

The movl $0, %eax instruction moves the value 0 into the eax register.

The popq %rbp instruction pops the value of the rbp register off of the stack.

The .cfi_def_cfa 7, 8 directive tells the assembler that the stack pointer is 8 bytes away from the current frame pointer.

The ret instruction returns from the function, and the .cfi_endproc directive tells the assembler that the function epilogue ends here.

Finally, we see the following:

.LFE0:
    .size   main, .-main

The .LFE0: label is a local label that is used to mark the end of the function.

The .size directive tells the assembler that the size of the main function is the difference between the main label and the .LFE0 label.

Dissassembly

Human writing.

Let’s talk about disassembly. For this one, I wrote (by hand) the following source.c file:

#include <stdio.h>

int main(void)
{
    int a = 1;
    int b = 1;
    for (int i = 0; i < 10; ++ i) {
      int c = a + b;
      a = b;
      b = c;
    }
    printf("%d\n", a + b);
    return 0;
}

We compile it to assembly with gcc -S source.c, yielding source.s:

	.file	"source.c"
	.text
	.section	.rodata
.LC0:
	.string	"%d\n"
	.text
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	endbr64
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$16, %rsp
	movl	$1, -16(%rbp)
	movl	$1, -12(%rbp)
	movl	$0, -8(%rbp)
	jmp	.L2
.L3:
	movl	-16(%rbp), %edx
	movl	-12(%rbp), %eax
	addl	%edx, %eax
	movl	%eax, -4(%rbp)
	movl	-12(%rbp), %eax
	movl	%eax, -16(%rbp)
	movl	-4(%rbp), %eax
	movl	%eax, -12(%rbp)
	addl	$1, -8(%rbp)
.L2:
	cmpl	$9, -8(%rbp)
	jle	.L3
	movl	-16(%rbp), %edx
	movl	-12(%rbp), %eax
	addl	%edx, %eax
	movl	%eax, %esi
	leaq	.LC0(%rip), %rdi
	movl	$0, %eax
	call	printf@PLT
	movl	$0, %eax
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	 1f - 0f
	.long	 4f - 1f
	.long	 5
0:
	.string	 "GNU"
1:
	.align 8
	.long	 0xc0000002
	.long	 3f - 2f
2:
	.long	 0x3
3:
	.align 8
4:

Note: to clarify, the above listing was from gcc, not OpenAI codex. We are going to have it try to reverse-engineer it. This is called disassembly. To this end I prompted the OpenAI codex with the compilation instruction, the above assembly code listing, and the next two paragraphs, and you can see the results:

By inspecting this and understanding what is happening, we can reconstruct a possible C program source.c. This is called disassembly.

Here is one possible value for the file source.c:

#include <stdio.h>

int main() {
    int a = 1;
    int b = 1;
    for (int i = 0; i < 9; i++) {
        int temp = a;
        a = b;
        b = temp + b;
    }
    printf("%d\n", a + b);
}

Looks like it made an off-by-one error. Stupid computer!

Categories:

Updated: