Owning the Image Object File Format, the Compiler Toolchain, and the Operating System: Solving Intractable Performance Problems Through Vertical Engineering

Closing Down Another Attack Vector

As the Windows kernel continues to pursue in its quest for ever-stronger security features and exploit mitigations, the existence of fixed addresses in memory continues to undermine the advances in this area, as attackers can use data corruption vulnerabilities and combine these with stack and instruction pointer control in order to bypass SMEP, DEP, and countless other architectural defense-in-depth techniques. In some cases, entire mitigations (such as CFG) are undone due to their reliance on a single, well-known static address.

In the latest builds of Windows 10 Redstone 1, aka “Anniversary Update”, the kernel takes a much stronger toward Kernel Address Space Layout Randomization (KASLR), employing an arsenal of tools that can only be available to an operating system developer that also happens to own the world’s most commercially successful compiler, and the world’s most pervasive executable object image format.

The Page Table Entry Array

One of the most unique aspects of the Windows kernel is the reliance on a fixed kernel address to represent the virtual base address of an array of page table entries that describes the entire virtual address space, and the usage of a self-referencing entry which acts as a pivot describing the page directory for the space itself (and, on x64 systems, describing the page directory table itself, and the page map level 4 itself).

This elegant solutions allows instant O(1) translation of any virtual address to its corresponding PTE, and with the correct shifts and base addresses, a conversion into the corresponding PDE (and PPE/PXE on x64 systems). For example, the function MmGetPhysicalAddress only needs to work as follows:

PHYSICAL_ADDRESS
MmGetPhysicalAddress (
    _In_ PVOID Address
    )
{
    MMPTE TempPte;

    /* Check if the PXE/PPE/PDE is valid */
    if (
#if (_MI_PAGING_LEVELS == 4)
       (MiAddressToPxe(Address)->u.Hard.Valid) &&
#endif
#if (_MI_PAGING_LEVELS >= 3)
       (MiAddressToPpe(Address)->u.Hard.Valid) &&
#endif
       (MiAddressToPde(Address)->u.Hard.Valid))
   {
       /* Check if the PTE is valid */
       TempPte = *MiAddressToPte(Address);
       ...
   }

Each iteration of the MMU table walk uses simple MiAddressTo macros such as the one below, which in turn rely on hard-code static addresses.

/* Convert an address to a corresponding PTE */
#define MiAddressToPte(x) \
   ((PMMPTE)(((((ULONG)(x)) >> 12) << 2) + PTE_BASE))

As attackers have figured out, however, this “elegance” has notable security implications. For example, if a write-what-where is mitigated by the existence of a read-only page (which, in Linux, would often imply requiring the WP bit to be disabled in CR0), a Windows attacker can simply direct the write-what-where attack toward the pre-computed PTE address in order to disable the WriteProtect bit, and then follow that by the actual write-what-where on the data.

Similarly, if an exploit is countered by SMEP, which causes an access violation when a Ring 0 Code Segment’s Instruction Pointer (CS:RIP) points to a Ring 3 PTE, the exploit can simply use a write-what-where (if one exists), or ROP (if the stack can be controlled), in order to mark the target user-mode PTE containing malicious code, as a Ring 0 page.

Other PTE-based attacks are also possible, such as by using write-what-where vulnerabilities to redirect a PTE to a different physical address which is controlled by the attacker (undocumented APIs available to Administrators will leak the physical address space of the OS, and some physical addresses are also leaked in the registry or CPU registers).

Ultimately, the list goes on and on, and many excellent papers exist on the topic. It’s clear that Microsoft needed to address this limitation of the operating system (or clever optimization, as some would call it). Unfortunately, a number of obstacles exist:

  • Using virtual-mapped tables based on the EPROCESS structure (as Linux and OS X do) causes significant performance impact, as pointer chasing the different tables now causes cache misses and page translations. This becomes even worse when thinking about multi-processor systems, and the cache waste that this causes (where the TLB may end up getting filled with the various global (locked) pages corresponding to the page tables of various processes, instead of only the current process).
  • Changing the address of the PTE array has a number of compatibility concerns, as PTE_BASE is actually documented in ntddk.h, part of the Windows Driver Kit. Additionally, once the new address is discovered, attackers can simply adjust their exploits to use the appropriate static address based on the version of the operating system.
  • Randomizing the address of the PTE array means that Windows memory manager functions can no longer use a static constant or preprocessor definition for the address, but must instead access a global address which contains it. Forcing every processor to dereference a single global address every single time a virtual memory operation (allocation, protection, page walk, fault, etc…) is performed is a significantly negative performance hit, especially on multi-socket, NUMA systems.
  • Dealing with the global variable problem above by creating cache-aligned copies of the address in a per-processor structure causes a waste of precious kernel storage space (for example, assuming a 64-byte cache line and 640 processors, 40KB of physical memory are used to replicate the variable most efficiently). However, on NUMA systems, one would also want the page containing this data to be local to the node, so we might imagine an overhead of 4KB per socket. In practice, this wouldn’t be quite as bad, as Windows already has a per-NUMA-node-allocated, per-processor, cache-aligned list of critical kernel variables: the Kernel Processor Region Control Block (KPRCB).

In a normal world, the final bullet would probably be the most efficient solution: sacrificing what is today a modest amount of physical memory (or re-using such a structure) for dealing with effects of global access. Yet, locating this per-processor data would still not be cheap: most operating systems access such a structure by relying on a segment register such as FS or GS on x86 and x64 systems, or use special CPU registers such as those located on CP15 inside of ARM processors. At the very least, this causes more pointer dereferences and potentially complex microcode accesses. But if we own the compiler and the output format, can’t we think outside the box?

Dynamic Relocation Generation

When the Portable Executable (PE) file format was created, its designers realized an important issue: if compiled code made absolute references to data or functions, these hardcoded pointer values might become invalid if the operating system loaded the executable binary at a different base address than its preferred address. Originally a corner case, the advent of user-mode ASLR made this a common occurrence and new reality.

In order to deal with such rebasing operations, the PE format includes the definition of a special data directory entry called the Relocation Table Directory (IMAGE_DIRECTORY_ENTRY_BASERELOC). In turn, this directory includes a number of tables, each of which is an array of entries. Each entry ultimately describes the offset of a piece of code that is accessing an absolute virtual address, and the required adjustment that is needed to fixup the address. On a modern x64 binary, the only possible fixup is an absolute delta (increment or decrement), but more exotic architectures such as MIPS and ARM had different adjustments based on how absolute addresses were encoded on such processors).

These relocations work great to adjust hardcoded virtual addresses that correspond to code or data within the image itself – but if there is a hard-coded access to 0xC0000000, an address which the compiler has no understanding of, and which is not part of the image, it can’t possibly emit relocations for it – this is a meaningless data dereference. But what if it could?

In such an implementation, all accesses to a particular magic hardcoded address could be described as such to the compiler, which could then work with the linker to generate a similar relocation table – but instead of describing addresses within the image, it would describe addresses outside of the image, which, if known and understood by the PE parser, could be adjusted to the new location of the hard-coded external data address. By doing so, compiled code would continue to access what appears to be a single literal value, and no global variable would ever be needed, cancelling out any disadvantages associated with the randomization of this address.

Indeed, the new build of the Microsoft C Compiler, which is expected to ship with Visual Studio 15 (now in preview), address a special annotation that can be associated with constant values that correspond to external virtual addresses. Upon usage of such a constant, the compiler will ensure that accesses are done in a way that does not “break up” the address, but rather causes its absolute value to be expressed in code (i.e.: “mov rax, 0xC0000000”). Then, the linker collects the RVAs of such locations and builds structures of type IMAGE_DYNAMIC_RELOCATION_ENTRY, as shown below:

typedef struct _IMAGE_DYNAMIC_RELOCATION_TABLE {
   DWORD Version;
   DWORD Size;
// IMAGE_DYNAMIC_RELOCATION DynamicRelocations[0];
} IMAGE_DYNAMIC_RELOCATION_TABLE, *PIMAGE_DYNAMIC_RELOCATION_TABLE;

When all entries have been written in the image, an IMAGE_DYNAMIC_RELOCATION_TABLE structure is written, with the type below:

typedef struct _IMAGE_DYNAMIC_RELOCATION {
   PVOID Symbol;
   DWORD BaseRelocSize;
// IMAGE_BASE_RELOCATION BaseRelocations[0];
} IMAGE_DYNAMIC_RELOCATION, *PIMAGE_DYNAMIC_RELOCATION;

The RVA of this table is then written into the IMAGE_LOAD_CONFIG_DIRECTORY, which has been extended with the new field DynamicValueRelocTable and whose size has now been increased:

   ULONGLONG DynamicValueRelocTable;         // VA
} IMAGE_LOAD_CONFIG_DIRECTORY64, *PIMAGE_LOAD_CONFIG_DIRECTORY64;

Now that we know how the compiler and linker work together to generate the data, the next question is who processes it?

Runtime Dynamic Relocation Processing

In the Windows boot architecture, as the kernel is a standard PE file loaded by the boot loader, it is therefore the boot loader’s responsibility to process the import table of the kernel, and load other required dependencies, to generate the security cookie, and to process the static (standard) relocation table. However, the boot loader does not have all the information required by the memory manager in order to randomize the address space as Windows 10 Redstone 1 now does – this remains the purview of the memory manager. Therefore, as far as the boot loader is concerned, the static PTE_BASE address is still the one to use, and indeed, early phases of boot still use this address (and associate PDE/PPE/PXE base addresses and self-referencing entry).

This clearly implies that it is not considered part of a PE loader’s job to process the dynamic relocation table, but rather the job of the component that creates the dynamic address space map, which has now been enlightened with this knowledge. In the most recent builds, this is done by MiRebaseDynamicRelocationRegions, which eventually calls MiPerformDynamicFixups. This routine locates the PE file’s Load Configuration Directory, gets the RVA (now a VA, thanks to relocations done by the boot loader) of the Dynamic Relocation Table, and begins parsing it. At this moment, it only supports version 1 of the table. Then, it loops through each entry, adjusting the absolute address with the required delta to point to the new PTE_BASE address.

It is important to note that the memory manager only calls MiPerformDynamicFixups on the binaries that it knows require such fixups due to the use of PTE_BASE: the kernel (ntoskrnl.exe) and the HAL (hal.dll). As such, this is not (yet) intended as a generic mechanism for all PE files to allow dynamic relocations of hard-coded addresses toward ASLRed regions of memory – but rather a highly vertically integrated feature specifically designed for dealing with the randomization of the PTE array, and the components that have hardcoded dependencies on it.

As such, even if one were to discover the undocumented annotation which allows the new version of the compiler to generate such tables, no component would currently parse such a table.

Sneaky Side Effects

A few interesting details are of note in the implementation. The first is that the initial version of the implementation, which shipped in build 14316, contained a static address in the loader block, which corresponded to the PTE base address that the loader had selected, and was then overwritten by a new fixed PTE base address (0xFFFFFA00`00000000 on x64).

The WDK, which contains the PTE_BASE address for developers to see (but apparently not use!) also contained this new address, and the debugger was updated to support it. This was presumably done to gauge the impact of changing the address in any way – and indeed we can see release notes referring to certain AV products breaking around the time this build was released. I personally noticed this change by disassembling MmGetPhysicalAddress to see if the PTE base had been changed (a normal part of my build analysis).

The next build, 14332, seemingly contained no changes: reverse engineering of the function showed usage of the same address once again. However, as I was playing around with the !pte extension in the debugger, I noticed that a new address was now being used – and that on a separate machine, this address was different again. Staring in IDA/Hex-Rays, I could not understand how this was possible, as MmGetPhysicalAddress was clearly using the same new base as 14316!

It is only once I unassembled the function in WinDBG that I noticed something strange – the base address had been modified to a different value. This led me to the hunt for the dynamic relocation table mechanism. But this is an important point about this implementation – it offers a small amount of “security through obscurity” as a side-effect: attackers or developers attempting to ‘dynamically discover’ the value of the PTE base by analyzing the kernel file on disk will hit a roadblock – they must look at the kernel file in memory, once relocations have been made. Spooky!

Conclusion

It is often said that all software engineering decisions and features lie somewhere between the four quadrants of security, performance, compatibility and functionality. As such, as an example, the only way to increase security without affecting functionality is to impact compatibility and performance. Although randomizing the PTE_BASE does indeed cause potential compatibility issues, we’ve seen here how control of the compiler (and the underlying linked object file) can allow implementers to “cheat” and violate the security quadrant, in a similar way that silicon vendors can often work with operating system vendors in order to create overhead-free security solutions (one major advantage that Apple has, for example).

How Control Flow Guard Drastically Caused Windows 8.1 Address Space and Behavior Changes

Windows 8.1 radically changes the address space layout of the system by finally removing the 44-bit limitation which I described in one of the earliest blog posts on this website (and which Wikipedia even links to!). This is a little-known detail about the operating system, and an odd thing for Microsoft not to emphasize on with more aplomb, especially given that 8.1 is considered a “patch” of Windows 8.

Now, you may think that 16 TB to 256 TB is a meaningless change since no applications currently use even a fraction of that space, but the main benefit of this change are not the ability to allocate additional memory, but rather the increased entropy space available for Address Space Load Randomization (ASLR), especially given that Windows 8 introduced High Entropy ASLR (HEASLR), Top-down Randomization and Anonymous Memory Randomization.

Additionally, another key change was done in Windows 8.1 that is not mentioned anywhere. As Pavel Lebedinsky, one of the lead SDETs on the Memory Manager and an extremely helpful individual indicated on one of the blog posts from Mark Russinovich:

1. Reserved memory does contribute to commit charge, because the memory manager charges commit for pagetable space necessary to map the entire reserved range. On 64 bit this can be a significant number (reserving 1 TB of memory will consume approximately 2 GB of commit).

This means that attempting to reserve the full 8 TB of memory on Windows 7 results in 16 GB of commit, which is beyond’s most people’s commit limit, especially at the time. In Windows 8.1, this would result in 128 GB of commit being used, which only a beefy server would tolerate. While such large memory reservations are unusual, they do have usefulness in certain scenarios related to security and low-level testing. This Windows behavior prevented such reservations from reliably working, but in Windows 8.1, the limitation has been removed!

Indeed, you can easily test this by using the TestLimit tool from the Windows Internals Book, and run it with the -r option (and preferably with a large enough block size). Here’s a screenshot of hitting the 128 TB reservation:

testlimit

And here’s the resulting view in VMMap, which does not show the expected page table commit charge, but rather a much smaller size (256 MB).

memvm

So why did Microsoft change this behavior in Windows 8.1? Well, Windows 10, as well as Windows 8.1 Update 3 (November Update) make this clear. As I previously tweeted, these OS versions enable Control Flow Guard (CFG), a feature that laid dormant in the first versions of Windows 8.1. In order to function, CFG requires the use of optimized bitmaps in order to determine the validity of indirect calls, and on 64-bit Windows, this bitmap requires 2 TB of space. Not only would this cut the Windows 8 address space by 25%, it would’ve also resulted in 4 GB of per-process commit!

Here’s a screenshot of Process Hacker showing how all CFG-enabled processes now use 2 TB of virtual address space:

2tb

The final effect of this change from 8 TB to 128 TB is that the kernel address space layout has significantly changed. And sadly, the !address extension in WinDBG is broken and continues to show the Windows 8 address space layout (which I expanded on during my Blackhat 2013 talk), while the Windows Internals book is stuck on Windows 7 and doesn’t even cover Windows 8 or higher.

Therefore, I publish below what I believe to be the only public source of information on the Windows 8.1 x64 memory layout. One of the benefits of this new layout is that it now becomes extremely easy by using the first 5 or 6 nibbles of an address to determine where it’s coming from. For example, 0xFFFFD… is a kernel stack, 0xFFFFC… is paged pool, 0xFFFFF8… is a loaded image (driver or kernel), and 0xFFFFE… is nonpaged pool.

StartEndSizeDescription
FFFF0000`00000000FFFF07FF`FFFFFFFF8TBMemory Hole
FFFF0800`00000000FFFFAFFF`FFFFFFFF168TBUnused Space
FFFFB000`00000000FFFFBFFF`FFFFFFFF16TBSystem Cache
FFFFC000`00000000FFFFCFFF`FFFFFFFF16TBPaged Pool
FFFFD000`00000000FFFFDFFF`FFFFFFFF16TBSystem PTEs
FFFFE000`00000000FFFFEFFF`FFFFFFFF16TBNonpaged Pool
FFFFF000`00000000FFFFF67F`FFFFFFFF6.5TBUnused Space
FFFFF680`00000000FFFFF6FF`FFFFFFFF512GBPTE Space
FFFFF700`00000000FFFFF77F`FFFFFFFF512GBHyperSpace
FFFFF780`00000000FFFFF780`00000FFF4KShared User Data
FFFFF780`00001000FFFFF780`BFFFFFFF~3GBSystem PTE WS
FFFFF780`C0000000FFFFF780`FFFFFFFF1GBWS Hash Table
FFFFF781`00000000FFFFF791`3FFFFFFF65GBPaged Pool WS
FFFFF791`40000000FFFFF799`3FFFFFFF32GBWS Hash Table
FFFFF799`40000000FFFFF7A9`7FFFFFFF65GBSystem Cache WS
FFFFF7A9`80000000FFFFF7B1`7FFFFFFF32GBWS Hash Table
FFFFF7B1`80000000FFFFF7FF`FFFFFFFF314GBUnused Space
FFFFF800`00000000FFFFF8FF`FFFFFFFF1TBSystem View PTEs
FFFFF900`00000000FFFFF97F`FFFFFFFF512GBSession Space
FFFFF980`00000000FFFFFA70`FFFFFFFF1TBDynamic VA Space
FFFFFA80`00000000FFFFFAFF`FFFFFFFF512GBPFN Database
FFFFFFFF`FFC00000FFFFFFFF`FFFFFFFF4MBHAL Heap
Table describing the various 64-bit memory ranges in Windows 8.1

PE Trick #1: A Codeless PE Binary File That Runs

Introduction

One of the annoying things of my Windows Internals/Security research is when every single component and mechanism I’ve looked at in the last six months has ultimately resulted in me finding very interesting design bugs, which I must now wait on Microsoft to fix before being able to talk further about them. As such, I have to take a smaller break from kernel-specific research (although I hope to lift the veil over at least one issue at the No Such Conference in Paris this year). And so, in the next following few blog posts, probably inspired by having spent too much time talking with my friend Ange Albertini, I’ll be going over some neat PE tricks.

Challenge

Write a portable executable (PE/EXE) file which can be spawned through a standard CreateProcess call and will result in STATUS_SUCCESS being returned as well as a valid Process Handle, but will not

  • Contain any actual x86/x64 assembly code section (i.e.: the whole PE should be read-only, no +X section)
  • Run a single instruction of what could be construed as x86 assembly code, which is part of the file itself (i.e.: random R/O data should not somehow be forced into being executed as machine code)
  • Crash or make any sort of interactive/visible notice to the user, event log entry, or other error condition.

Interesting, this was actually a real-world situation that I was asked to provide a solution for — not a mere mental exercise. The idea was being able to prove, in the court of law, that no “foreign” machine code had executed as a result of this executable file having been launched (i.e.: obviously the kernel ran some code, and the loader ran too, but all this is pre-existing Microsoft OS code). Yet, the PE file had to not only be valid, but to also return a valid process handle to the caller.

Solution

HEADER:00000000 ; IMAGE_DOS_HEADER
HEADER:00000000
HEADER:00000000 .686p
HEADER:00000000 .mmx
HEADER:00000000 .model flat
HEADER:00000000
HEADER:00000000 ; Segment type: Pure data
HEADER:00000000 HEADER segment page public 'DATA' use32
HEADER:00000000 assume cs:HEADER
HEADER:00000000 __ImageBase dw 5A4Dh ; PE magic number
HEADER:00000002 dw 0 ; Bytes on last page of file
HEADER:00000004 ; IMAGE_NT_HEADERS
HEADER:00000004 dd 4550h ; Signature
HEADER:00000008 ; IMAGE_FILE_HEADER
HEADER:00000008 dw 14Ch ; Machine
HEADER:0000000A dw 0 ; Number of sections
HEADER:0000000C dd 0 ; Time stamp
HEADER:00000010 dd 0 ; Pointer to symbol table
HEADER:00000014 dd 0 ; Number of symbols
HEADER:00000018 dw 0 ; Size of optional header
HEADER:0000001A dw 2 ; Characteristics
HEADER:0000001C ; IMAGE_OPTIONAL_HEADER
HEADER:0000001C dw 10Bh ; Magic number
HEADER:0000001E db 0 ; Major linker version
HEADER:0000001F db 0 ; Minor linker version
HEADER:00000020 dd 0 ; Size of code
HEADER:00000024 dd 0 ; Size of initialized data
HEADER:00000028 dd 0 ; Size of uninitialized data
HEADER:0000002C dd 7FBE02F8h ; Address of entry point
HEADER:00000030 dd 0 ; Base of code
HEADER:00000034 dd 0 ; Base of data
HEADER:00000038 dd 400000h ; Image base
HEADER:0000003C dd 4 ; Section alignment
HEADER:00000040 dd 4 ; File alignment
HEADER:00000044 dw 0 ; Major operating system version
HEADER:00000046 dw 0 ; Minor operating system version
HEADER:00000048 dw 0 ; Major image version
HEADER:0000004A dw 0 ; Minor image version
HEADER:0000004C dw 4 ; Major subsystem version
HEADER:0000004E dw 0 ; Minor subsystem version
HEADER:00000050 dd 0 ; Reserved 1
HEADER:00000054 dd 40h ; Size of image
HEADER:00000058 dd 0 ; Size of headers
HEADER:0000005C dd 0 ; Checksum
HEADER:00000060 dw 2 ; Subsystem
HEADER:00000062 dw 0 ; Dll characteristics
HEADER:00000064 dd 0 ; Size of stack reserve
HEADER:00000068 dd 0 ; Size of stack commit
HEADER:0000006C dd 0 ; Size of heap reserve
HEADER:00000070 dd 0 ; Size of heap commit
HEADER:00000074 dd 0 ; Loader flag
HEADER:00000078 dd 0 ; Number of data directories
HEADER:0000007C HEADER ends
HEADER:0000007C end

As per Corkami, in Windows 7 and higher, you’ll want to make sure that the PE is at least 252 bytes on x86, or 268 bytes on x64.

Here’s a 64 byte Base64 representation of a .gz file containing the 64-bit compatible (268 byte) executable:

H4sICPwJKlQCAHguZXhlAPONYmAIcGVg8GFkQANMDNxoYj+Y9tUjeA4MLECSBc5HsB1QTBk6AAB
e6Mo9DAEAAA==

Caveat

There is one non-standard machine configuration in which this code will actually still crash (but still return STATUS_SUCCESS in CreateProcess, however). This is left as an exercise to the reader.

Conclusion

The application executes and exits successfully. But as you can see, no code is present in the binary. How does it work? Do you have any other solutions which satisfy the challenge?

The Case Of The Bloated Reference Count: Handle Table Entry Changes in Windows 8.1

Introduction

As part of my daily reverse engineering and peering into Windows Internals, I started noticing a strange effect in Windows 8.1 whenever looking at the reference counts of various objects with tools such as WinDBG, Process Explorer, and Process Hacker: seemingly gigantic values on x64 Windows, and smaller, yet still incredibly large values on x86.

For the uninitiated, reference counts (internally called pointer counts), and their cousin handle counts, are the Windows kernel’s way of keeping track of open instances to a certain object (such as a file, registry key, or mutex) in order to implement automatic cleanup and garbage collection. Windows system tools such as Process Explorer or Process Hacker often have handy interfaces for looking at the objects to which a process currently has references to, by analyzing the process handle table.

Looking at Opened Handles and their Properties

In the screenshot below, you can see me looking at the first few handles of the Windows shell, Explorer.exe. Particularly, I am interested in the “DBWinMutex” mutex, at handle 0x44.

What this mutex does is gate access to Windows’ debug buffer, used by the OutputDebugString API, so it’s likely that you’ll see it used in many other processes as well. Since Explorer has at least one component using that API, it has a handle opened to it. Let’s go find out how many other components have a handle to it, by double-clicking and looking at its properties.

Pretty striking, isn’t it? While the handle count, which keeps track of actual handles to the object (implying that (Zw)OpenEvent was used to obtain the reference) is 14 and makes sense given the large number of processes that use the debug buffer to print various trace messages, the reference count, which is meant to include those handles plus any other additional internal kernel component references (which can bypass handles altogether and use the ObReferenceObject family of APIs to safely reference an object), is actually 491351! While it’s technically possible for such a large number of kernel references to exist to the object, it’s highly unlikely, and if one checks the reference counts on other objects, similarly large numbers appear. What’s going on?

Using the Windows Debugger to Dump Object Information

First, let’s make sure this isn’t a bug in Process Explorer. Such tools that peer into undocumented structures are often risk prone to subtle changes in the kernel, so I like to use the Windows Kernel Debugger (WinDBG) to validate what user-mode tools are showing. After all, the debugger dumps the raw memory of the object, which is the ground truth. As you can see below, we can use the handy !object extension to go find the object.

32767 Shades of Reference Bias

As you can see, we’re not really getting anywhere here – WinDBG shows an equally large value (458,584) although it’s not quite the same as Process Explorer’s. In fact, it’s exactly:

491351 – 458584 = 32767 (0x7FFF)

This can’t be a coincidence, can it? In fact, looking at other objects in Process Explorer, and comparing the reference count with WinDBG shows a similar pattern – not only are the numbers huge, but Process Explorer is always off by 0x7FFF. I also noticed a second pattern – the more handles that the object had, the bigger the reference count was, and always by a factor of around, or almost, 32767. In this case, dividing 458584 references by 14 handle counts gives us 32756 references-per-handle – close enough. Doing the opposite math on 491351 references gives us 14.995 handles.

Having worked on Process Explorer previously, I knew that as part of the code which handles the properties dialog and queries information on the object, the tool open its own handle to the object, temporarily creating 15 handles. Something became clear: there is now a bias in the reference count of objects, based on the number of handles. However, this bias is not exactly 32767, so something else must be going on.

Globally Searching for Opened Handles with Process Explorer

On a hunch, I decided to take a look at what would happen if I used Process Explorer’s “Find Handle or DLL” functionality, which searches all handles, system-wide, in order to find any which contain the name that the user entered. Because Windows only returns a list of PIDs and Handle Values, Process Explorer then has to attach to the process associated with the PID (since handles are local to each process) and then open the handle so that it can query its name. Let’s see what the search returned:

Fourteen processes have handles open to the DBWinMutex object. Let’s see what happened to the reference count…

The reference count went down to 491337. Which happens to be – wait for it – exactly 14 references less than what we had before. Repeating the exercise a few more times perfectly reproduces this behavior. Each time a new search is done, 14 processes are found (with 1 handle each), and the reference count goes down by 14 again.

The Per-Handle Reference Bias Revealed

At this point, we can infer the following two patterns:

  • Each time a new handle is opened to an object, the reference count goes up by 0x7FFF, or 32767, on x64 Windows. On x86 Windows, the same behavior is seen by the way, but with 0x1F instead.
  • Each time an existing handle to an object is used, the reference count goes down by 1.

The last part in this exercise was trying to understand where this data is coming from. The last bullet point above suggests that there is some sort of per-handle reference count, so I used the !handle extension in WinDBG to locate the handle entry for Explorer’s (PID 4440 as seen earlier) handle to DBWinMutex (handle 44 as seen earlier). I used flag 2 to request the object information as well. As you’ll see below, this gave me the pointer to the handle table entry, which I’ve highlighted in green. We can then use WinDBG’s symbol information to dump the entry using the dt command the _HANDLE_TABLE_ENTRY type inside the nt module.

As someone who has often dumped handle table entries in the debugger, the structure was striking to me, as it was very different from anything I had seen before. In fact, handle table entries only really stored two things before – the pointer to the object, and the granted access mask to the object. Yes, a few flags were used, but definitely nothing like we see above in Windows 8.1.

The New Handle Table Entry Format

Here’s the big changes from previous versions of Windows, on x64:

  • Instead of storing the full 64-bit pointer to the object header, Windows now only stores a 44 bit pointer. The bottom four bits are inferred to be all zeroes as all 64-bit allocations, code, and stack locations are 16-byte aligned, while the top sixteen bits are inferred to be all ones, as architecturally defined by the amd64 achitecture per the rules of canonical addresses (there must now be a dozen algorithms in Windows which rely on these bits having pre-defined, unchanging values!).
  • Three of the assumed bits are re-used to store the three handle attributes (inherited, audited, protected), while a fourth is used to store the lock bit for the handle entry.
  • Finally, the remaining 16-bits are now used to store an inverted reference count which keeps track of the amount of times that a handle has been used by a process. This reference count begins at 0x7FFF and counts down to zero for each additional reference made on the handle. The reference count (i.e.: the pointer count field in the object header) is biased by the number of inverted reference counts in each handle to the process.
  • Because the access mask is only 25 bits if you ignore the generic access rights (which are always translated into specific rights), additional bits can be used for flags. One such bit is used, the others are spare.
  • This leaves an unused 32-bit value that was wasted for alignment purposes on earlier versions of Windows. In Windows 8.1, this is now used to store the TypeInfo field, which is the Object Type Index in the Object Type Index Table (nt!ObTypeIndexTable). Dereferencing this index quickly reveals the object type for this handle, without having to even look at the object header.

On x86 Windows, the structure is different, but the changes semantically similar:

  • No assumptions can be made on the top bits, so the entry continues to store a pointer to the object header, in which the bottom 3 bits are re-used to store the lock bit and 2 of the handle attributes (inherited, audited) as all x86 allocations are 8 byte aligned.
  • Because the granted access mask is only 25 bits, the remaining 7 bits can now be used to store the missing attribute flag (protected), leaving 6 bits to store the reference count. As such, the reference count starts at 0x1F instead, on x86 systems.
  • There is no additional space lost due to alignment, so there is no space to store the TypeInfo field.

Conclusion

As you can see, Windows 8.1 not only introduces a major rewrite to the handle table entry format but also makes these seemingly internal data structure changes to have a visible side effect when using the Windows Debugger or other tools to analyze reference counts on objects, something which driver developers often have to do (and even support professionals when troubleshooting leaks).

Additionally, for forensic analysts, the fact that there is now a per-handle “reference count”, which Microsoft should’ve really called an inverted access count, allows one to get a very detailed understanding of the number of times a handle has been used (and thus perhaps glean insight into unusual uses of the handle).

On a final note, this is a really good example of the type of Windows Internals analysis that one can do without doing any actual “black room” reverse engineering – I didn’t have to open IDA a single time or look at a single line of assembly code to discover and understand this functionality. By merely interacting with the system, deducing logic, and looking at state changes, the behavior became clear. If you ever note any other interesting Windows functionality or behavior that you’ve never been able to explain, feel free to leave a comment!

New Security Assertions in “Windows 8”

Anyone reversing “Windows 8” will now find a non-familiar piece of code, whenever a list insertion operation is performed on a LIST_ENTRY:

.text:00401B65                 mov     edx, [eax]
.text:00401B67                 mov     ecx, [eax+4]
.text:00401B6A                 cmp     [edx+4], eax
.text:00401B6D                 jnz     SecurityAssertion
.text:00401B73                 cmp     [ecx], eax
.text:00401B75                 jnz     SecurityAssertion
....
.text:00401C55 SecurityAssertion:               
.text:00401C55
.text:00401C55                 push    3
.text:00401C57                 pop     ecx
.text:00401C58                 int     29h

Or, seen from Hex-Rays:

if ( ListEntry->Flink->Blink != ListEntry ||
     Blink->Flink != ListEntry )
{
  __asm { int     29h   } // Note that the "push 3" is lost
}

Dumping the IDT reveals just what exactly “INT 29h” is:

lkd> !idt 29

Dumping IDT:

29: 80d5409c nt!_KiRaiseSecurityCheckFailure

Which would indicate that Win8 now has a new kind of “ASSERT” statement that is present in retail builds, designed for checking again certain common security issues, such as corrupted/dangling list pointers.

Thankfully, Microsoft was nice enough to document where this is coming from, and I’ve even been told they want to encourage its use externally. Starting in “Windows 8”, if you leave NO_KERNEL_LIST_ENTRY_CHECKS undefined, the new LIST_ENTRY macros will add a line RtlpCheckListEntry(Entry); to verify the lists between operations. This expands to:

FORCEINLINE
VOID
RtlpCheckListEntry(
    _In_ PLIST_ENTRY Entry
    )
{
    if ((((Entry->Flink)->Blink) != Entry) ||
        (((Entry->Blink)->Flink) != Entry))
    {
        FatalListEntryError(
            (PVOID)(Entry),
            (PVOID)((Entry->Flink)->Blink),
            (PVOID)((Entry->Blink)->Flink));
    }
}

So what is FatalListEntryError?

FORCEINLINE
VOID
FatalListEntryError(
    _In_ PVOID p1,
    _In_ PVOID p2,
    _In_ PVOID p3
    )
{
    UNREFERENCED_PARAMETER(p1);
    UNREFERENCED_PARAMETER(p2);
    UNREFERENCED_PARAMETER(p3);

    RtlFailFast(FAST_FAIL_CORRUPT_LIST_ENTRY);
}

At last, we can see where the INT 29H (push 3) seems to be stemming from. In fact, RtlFastFail is then defined as:

//++
//VOID
//RtlFailFast (
//    _In_ ULONG Code
//    );
//
// Routine Description:
//
//    This routine brings down the caller immediately in the
//    event that critical corruption has been detected.
//    No exception handlers are invoked.
//
//    The routine may be used in libraries shared with user
//    mode and kernel mode.  In user mode, the process is
//    terminated, whereas in kernel mode, a
//    KERNEL_SECURITY_CHECK_FAILURE bug check is raised.
//
// Arguments
//
//    Code - Supplies the reason code describing what type
//           of corruption was detected.
//
// Return Value:
//
//     None.  There is no return from this routine.
//
//--
DECLSPEC_NORETURN
FORCEINLINE
VOID
RtlFailFast(
    _In_ ULONG Code
    )
{
    __fastfail(Code);
}

And finally, to complete the picture:

//
// Fast fail failure codes.
//
#define FAST_FAIL_RANGE_CHECK_FAILURE         0
#define FAST_FAIL_VTGUARD_CHECK_FAILURE       1
#define FAST_FAIL_STACK_COOKIE_CHECK_FAILURE  2
#define FAST_FAIL_CORRUPT_LIST_ENTRY          3
#define FAST_FAIL_INCORRECT_STACK             4
#define FAST_FAIL_INVALID_ARG                 5
#define FAST_FAIL_GS_COOKIE_INIT              6
#define FAST_FAIL_FATAL_APP_EXIT              7

#if _MSC_VER >= 1610
DECLSPEC_NORETURN
VOID
__fastfail(
    _In_ unsigned int Code
    )
#pragma intrinsic(__fastfail)
#endif

So there you have it, the new __fastfail intrinsic generates an INT 29H, at least on x86, and the preceding 8 security failures are registered by Windows — I assume driver developers and user application developers could define their own internal security codes as well, preferably starting with a high enough ID not to interfere with future codes Microsoft may choose to add.

The bugcheck, by the way, is defined as:

//
// MessageId: KERNEL_SECURITY_CHECK_FAILURE
//
// MessageText:
//
// A kernel component has corrupted a critical data structure.
// The corruption could potentially allow a malicious user to
// gain control of this machine.
//
#define KERNEL_SECURITY_CHECK_FAILURE ((ULONG)0x00000139L)

This is a great mechanism that should make security issues much more “visible” to users, even if it means taking the system down. Hopefully the new and improved blue screen of death — the Sad Face Of Sorrow (SFOS) — will give users more indication as to why their system had to be taken down, as the current implementation lacks the details needed to differentiate between a crash, and a security failure such as this.

Building the Lego Millennium Falcon: A Lesson in Security?

Not all of a reverse engineer’s life has to be about undoing — sometimes it is equally as fun to build something from scratch, whether that means a new tool… or the Star Wars 30 Year Anniversary Lego Ultimate Collector’s Millennium Falcon! Over the course of the last three weeks, my best friend and myself have spent countless hours building this magnificent model, which has over 5000 pieces, 91 “major” construction steps (with each step taking up to 30 sub-steps, sometimes 2x’d or 4x’d) and a landscape, 8×14″, 310 page instruction manual.

Late last night, we completed the final pieces of the hull, the radar dish, and the commemorative plaque (itself made of Lego). We had previously built the Imperial Star Destroyer (ISD) last year, but nothing in our Lego-building lives had ever quite come close to the work we put into this set. Complete full-size pictures after the entry.

Along the way, we both learnt some important facts about the Lego manufacturing process — for example, we had already noticed that in our ISD set had some extra pieces, and that other people’s sets had different extra pieces, however, we weren’t too sure what to make of it. This year however, due to the fact we were missing exactly half the number of lever pieces and 1×2 “zit” pieces, we did some extra digging.

The first that boggles many Lego builders of such large sets, is the arrangement of pieces within the bags. These pieces are not arranged in construction order. If you break all the bags and sort the pieces, there is nothing wrong with doing so! Whether or not that will save you time however, is up for discussion. We did do a small sorting, mostly to separate hull pieces from thick pieces from greebling pieces, and that did seem to help a lot. However, I wouldn’t recommend spending 5 hours sorting the pieces as was done on Gizmodo.

By step 21 unfortunately, we noticed we were missing a piece — something that Lego says should never happen (more on this shortly). Step 22 also required that piece, and the next few steps required even more. I came up with an interesting idea — a missing bag. We counted the number of pieces we’d still need, and it came up to be exactly the number of pieces we had used. In other words, half of the pieces were missing. Those pieces were in the same bag as the level-looking pieces, and those too, after calculations, were missing half their number. By this point, we were sure that we were missing a bag, and went to check Lego’s site for help.

It turns out that Lego’s site does have a facility available for ordering missing set pieces, and for free! After putting in the appropriate information, including the set number, and piece number, I filled out my address… and received a “Thank you”. Unfortunately, we never got any further confirmation, and the missing pieces have yet to arrive. Most interesting however, was a notice on Lego’s website, claiming that missing pieces are quite rare, because each set is precision-weighted, so missing sets get flagged for extra, human QA. At first sight, this makes it very unlikely for a set to be missing a piece — so how did we end up missing nearly 60 pieces? (Note to readers: the missing pieces are all hull greeblin (decorative, not structural), and we dully marked down the steps at which they were required, so we can add them once we receive them).

The answer to that question only came to us once we completed construction of the Falcon. There it was, on our workshop table (my kitchen table): an opened bag, full of Lego pieces… which we had already used up, and which weren’t required! It then became pretty obvious to us that Lego has a major flaw in their weight-based reasoning: replacement! We couldn’t scientifically verify it, but the extra bag we had was of similar size and weight (being small pieces) to the second bag of levels and 1×2 pieces we needed. Evidently, a machine error (most probably) or human error caused an incorrect bag of pieces to be added to the set. At the QA phase, the set passed the weight tests, because this bag was of the same (or nearly the same) weight as the missing bag!

Furthermore, due to the fact both the ISD and the Falcon had pieces that were not included in the Appendix of piece counts (again, probably due to machine error while composing the set), their combined weight may have even pushed the weight of our set past the expected weight. It is unlikely that Lego would flag heavier sets for QA — at worst, the customer would get some free pieces of Lego. However, when that weight helps offset the weight of missing pieces, it can certainly become a problem. And when a bag of pieces is accidentally replaced by a similar bag, then weight measuring doesn’t do much at all.

Granted, a lot of our analysis is based on assumptions, but they certainly do check out. Lego says their primary method of checking sets is to weigh them, and we have an extra bag, and a missing bag, both of similar size and composure. The hypothesis seems valid, and perhaps a phone call to Lego will confirm it (if I don’t get the pieces soon, I certainly plan on doing that).

Ironically, this breakage of a QA test through replacement was similar to an interesting security question I received by mail recently: why is it a bad idea to use the Owner SID of an object as a way to authenticate the object, or its creator? It turns out this can leave you vulnerable to rename operations, such as on a file with write and delete access, allowing someone to impersonate the Owner SID but have their own data in the object. Breaking a security test by renaming an object, or breaking a quality assurance test by replacing a bag — the two stem from the same problem: bad design and simplistic assumptions.

And now, without further ado, here’s a link to our construction pictures.

Behind Windows x64’s 44-bit Virtual Memory Addressing Limit

The era of 64-bit computing is finally upon the consumer market, and what was once a rare hardware architecture has become the latest commodity in today’s processors. 64-bit processors promise not only a larger amount of registers and internal optimizations, but, perhaps most importantly, access to a full 64-bit address space, increasing the maximum number of addressable memory from 32-bits to 64-bits, or from 4GB to 16EB (Exabytes, about 17 billion GBs). Although previous solutions such as PAE enlarged the physically addressable limit to 36-bits, they were architectural “patches” and not real solutions for increasing the memory capabilities of hungry workloads or applications.

Although 16EB is a copious amount of memory, today’s computers, as well as tomorrow’s foreseeable machines (at least in the consumer market) are not yet close to requiring support for that much memory. For these reasons, as well as to simplify current chip architecture, the AMD64 specification (which Intel used for its own implementation of x64 processors, but not Itanium) currently only supports 48 bits of virtual address space — requiring all other 16 bits to be set to the same value as the “valid” or “implemented” bits, resulting in canonical addresses: the bottom half of the address space starts at 0x0000000000000000 with only 12 of those zeroes being part of an actual address (resulting in an end at 0x00007FFFFFFFFFFF), while the top half of the address space starts at 0xFFFF800000000000, ending at 0xFFFFFFFFFFFFFFFF.

As you can realize, as newer processors support more of the addressing bits, the lower-half of memory will expand upward, towards 0x7FFFFFFFFFFFFFFF, while the upper-half of memory will expand downward, toward 0x8000000000000000 (a similar split to today’s memory space, but with 32 more bits). Anyone working with 64-bit code ought to be very familiar with this implementation, since it can have subtle effects on code when the number of implemented bits will grow. Even in the 32-bit world, a number of Windows applications (including system code in Windows itself) assume the most significant bit is zero and use it as a flag — clearly the address would become kernel-mode, so the application would mask this bit off when using it as an address. Now developers get a shot at 16 bits to abuse as flags, sequence numbers and other optimizations that the CPU won’t even know about (on current x64 processors), on top of the usual bits that can be assumed due to alignment or user vs kernel-mode code location. Compiling the 64-bit application for Itanium and testing it would reveal such bugs, but this is beyond the testing capabilities of most developers.

Examples within Microsoft’s Windows are prevalent — pushlocks, fast references, Patchguard DPC contexts, and singly-linked lists are only some of the common Windows mechanisms which utilize bits within a pointer for non-addressing purposes. It is the latter of these which is of interest to us, due to the memory addressing limit it imposed on Windows x64 due to a lack of a CPU instruction (in the initial x64 processors) that the implementation required. First, let’s have a look at the data structure and functionality on 32-bits. If you’re unsure on what exactly a singly-linked list is, I suggest a quick read in an algorithm book or Google.

Here is the SLIST_HEADER, the data structure Windows uses to represent an entry inside the list:

typedef union _SLIST_HEADER {
    ULONGLONG Alignment;
    struct {
        SLIST_ENTRY Next;
        USHORT Depth;
        USHORT Sequence;
    } DUMMYSTRUCTNAME;
} SLIST_HEADER, *PSLIST_HEADER;

Here we have an 8-byte structure, guaranteed to be aligned as such, composed of three elements: the pointer to the next entry (32-bits, or 4 bytes), and depth and sequence numbers, each 16-bits (or 2 bytes). Striving to create lock-free push and pop operations, the developers realized that they could make use of an instruction present on Pentium processors or higher — CMPXCHG8B (Compare and Exchange 8 bytes). This instruction allows the atomic modification of 8 bytes of data, which typically, on a 486, would’ve required two operations (and thus subjected these operations to race conditions requiring a spinlock). By using this native CPU instruction, which also supports the LOCK prefix (guaranteeing atomicity on a multi-processor system), the need for a spinlock is eliminated, and all operations on the list become lock-free (increasing speed).

On 64-bit computers, addresses are 64-bits, so the pointer to the next entry must be 64-bits. If we keep the depth and sequence numbers within the same parameters, we require a way to modify at minimum 64+32 bits of data — or better yet, 128. Unfortunately, the first processors did not implement the essential CMPXCHG16B instruction to allow this. The developers had to find a variety of clever ways to squeeze as much information as possible into only 64-bits, which was the most they could modify atomically at once. The 64-bit SLIST_HEADER was born:

struct {  // 8-byte header
        ULONGLONG Depth:16;
        ULONGLONG Sequence:9;
        ULONGLONG NextEntry:39;
} Header8;

The first sacrifice to make was to reduce the space for the sequence number to 9 bits instead of 16 bits, reducing the maximum sequence number the list could achieve. This still only left 39 bits for the pointer — a mediocre improvement over 32 bits. By forcing the structure to be 16-byte aligned when allocated, 4 more bits could be won, since the bottom bits could now always be assumed to be 0. This gives us 43-bits for addresses — we can still do better. Because the implementation of linked-lists is used *either* in kernel-mode or user-mode, but cannot be used across address spaces, the top bit can be ignored, just as on 32-bit machines: the code will assume the address to be kernel-mode if called in kernel-mode, and vice-versa. This allows us to address up to 44-bits of memory in the NextEntry pointer, and is the defining constraint of Windows’ addressing limit.

44 bits is nothing to laugh at — they allow 16TB of memory to be described, and thus splits Windows into somewhat two even chunks of 8TB for user-mode and kernel-mode memory. Nevertheless, this is still 16 times smaller then the CPU’s own limit (48 bits is 256TB), and even farther still from the maximum 64-bits. So, with scalability in mind, there do exist some other bits in the SLIST_HEADER which define the type of header that is being dealt with — because yes, there is an official 16-bit header, written for the day when x64 CPUs would support 128-bit Compare and Exchange (which they now do). First, a look at the full 8-byte header:

 struct {  // 8-byte header
        ULONGLONG Depth:16;
        ULONGLONG Sequence:9;
        ULONGLONG NextEntry:39;
        ULONGLONG HeaderType:1; // 0: 8-byte; 1: 16-byte
        ULONGLONG Init:1;       // 0: uninitialized; 1: initialized
        ULONGLONG Reserved:59;
        ULONGLONG Region:3;
 } Header8;

Notice how the “HeaderType” bit is overlaid with the Depth bits, and allows the implementation and developers to deal with 16-byte headers whenever they will be put into use. This is what they look like:

 struct {  // 16-byte header
        ULONGLONG Depth:16;
        ULONGLONG Sequence:48;
        ULONGLONG HeaderType:1; // 0: 8-byte; 1: 16-byte
        ULONGLONG Init:1;       // 0: uninitialized; 1: initialized
        ULONGLONG Reserved:2;
        ULONGLONG NextEntry:60; // last 4 bits are always 0’s
 } Header16;

Note how the NextEntry pointer has now become 60-bits, and because the structure is still 16-byte aligned, with the 4 free bits, leads to the full 64-bits being addressable. As for supporting both these headers and giving the full address-space to compatible CPUs, one could probably expect Windows to use a runtime “hack” similar to the “486 Compatibility Spinlock” used in the old days of NT, when CMPXCHG8B couldn’t always be assumed to be present (although Intel implemented it on the 586 Pentium, Cyrix did not). As of now, I’m not aware of this 16-byte header being used.

So there you have it — an important lesson not only on Windows 64-bit programming, but also on the importance of thinking ahead when making potentially non-scalable design decisions. Windows did a good job at maintaining capability and still allowing expansion, but the consequences of attempting to use parts of the non-implemented bits in current CPUs as secondary data may be hard to detect once your software evolves to those platforms — tread carefully.

Some Vista Tips & Tricks

Here’s a couple of various useful tips I’ve discovered (as I’m sure others have) which make my life easier on Vista, and saved me a lot of trouble.

Fix that debugger!

I had done everything right to get local kernel debugging to work: I added /DEBUG with bcdedit. I used WinDBG in Administrator mode, I even turned off UAC. I made sure that all the debugging permissions were correct for admin accounts (SeDebugPrivilege). I made sure the driver WinDBG uses was extracted. Nothing worked! For some reason, the system was declaring itself as not having a kernel debugger enabled. I searched Google for answers, and found other people were experiencing the same problem, including a certain “Unable to enable kernel debugger, NTSTATUS 0xC0000354 An attempt to do an operation on a debug port failed because the port is in the process of being deleted” error message.

Here’s what I did to fix it:

  1. I used bcdedit to remove and re-add the /debug command: bcdedit /debug off then bcdedit /debug on
  2. I used bcdedit to force the debugger to be active and to listen on the Firewire port: bcdedit /dbgsettings 1394 /start active
  3. I used bcdedit to enable boot debugging: bcdedit /bootdebug
  4. I rebooted, and voila! I was now able to use WinDBG in local kernel debugging mode.

Get the debugger state

In Vista, one must boot with /DEBUG in order to do any local kernel debugging work, and as we all know, this disables DVD and high-def support to “ensure a reliable and secure playback environment”. After a couple of days work, I find myself sometimes forgetting if I had booted in debug mode or not, and I don’t want to start WinDBG all the time to check — or maybe you’re in a non-administrative environment and don’t want to use WinDBG. It’s even possible a new class of malware may soon start adding /DEBUG to systems in order to better infiltrate kernel-mode. Here’s two ways to check the state:

  1. Navigate to HKLM\System\CurrentControlSet\Config and look at the SystemStartOptions. If you see /DEBUG, the system was booted in debugging mode. However, because it may have been booted in “auto-enable” mode, or in “block enable” mode (in which the user can then enable/disable the debugger at his will), this may not tell you the whole story.
  2. Use kdbgctrl, which ships with the Debugging Tools for Windows. kdbgctrl -c will tell you whether or not the kernel debugger is active right now.

What would be really worthwhile, is to see if DVD playback works with /DEBUG on, but kernel debugging disabled with kdbgctrl, and if it stops playing (but debugging works) if kdbgctrl is used to enable it.

Access those 64-bit system binaries from your 32-bit app

I regularly have to use IDA 64 to look at binaries on my Vista system, since I’m using the 64-bit edition. Unfortunately, IDA 64 itself is still a 32-bit binary, which was great when I had a 32-bit system, but not so great anymore. The reason is WOW64’s file redirection, which changes every access to “System32” to “SysWOW64”, the directory containing 32-bit DLLs. But IDA 64 is perfectly able (and has to!) handle 64-bit binaries, so there wasn’t an easy way to access those DLLs. Unfortunately, it looks like the IDA developers haven’t heard of Wow64DisableFsRedirection, so I resorted to copying the binaries I needed out of the directory manually.

However, after reverse engineering some parts of WOW64 to figure out where the WOW64 context is saved, I came across a strange string — “sysnative”. It turns out that after creating (with a 64-bit app, such as Explorer) a “sysnative” directory in my Windows path, any 32-bit application navigating to it will now get the “real” (native) contents of the System32 directory! Problem solved!

So there you have it! I hope at least one of these was useful (or will be), and feel free to share your own!

Introducing Haute Secure

For the last couple of months I’ve had the chance to meet and work with some of the brightest developers and people behind what I think is a pretty revolutionary way to secure the online experience of users: the team behind Haute Secure.

In short, Haute Secure is a Malware Filter, much like a Phishing or Spam Filter in existing applications. It provides a beautiful (you really have to see it!) interface and toolbar to IE (and soon Firefox) which protects users from incoming malware on a variety of levels, starting from the site level to the execution level. If cnn.com were to be hacked tomorrow with an unreleased exploit that would attempt to download a worm or other malware on visitors’ machines, Haute would be able to detect that, and block the exploit from happening. When this happens, Haute will communicate with its servers and post a notification, so a site becomes known “bad” as users stumble upon it. But Haute doesn’t only rely on its users; it also ships with a very large database of malicious sites out there. Haute is also smart enough to avoid tagging an entire domain as “bad”. Many sites such as MySpace, Yahoo and others can host individual user content, and don’t deserve to be blacklisted due to certain sub-sites. Haute can blacklist only certain parts of a domain, such as a user’s site, and will also tag the site with a warning, to let users know that -some- pages may be dangerous.

Sandi did a pretty good review of the product on her blog, but as someone whose actually worked on the product and had intimate knowledge of its behavior (as well as having worked on similar products in the past), I’d like to give my own technical review and why I think Haute is way ahead of the pack when it comes to this market.

The first reason I love this product so much is because unlike almost all anti-virus products, firewalls and IPS software, it’s actually written to properly interface with the OS. It’s fully compatible with Vista, even 64-bits, and co-exists with PatchGuard and other integrity mechanisms. The driver behind Haute Secure (and yes, it’s a driver, not a collection of user-mode hooking DLLs!) makes use of all the filtering technology available in NT without sacrificing functionality.

The second thing that I think is exciting about Haute is the fact that it strongly relies on a community of users, and not on hard-coded rules or filters (although, like I said, it does come with a large database already). I used to work on a product called SPAMfighter ages ago, and I saw how filtering spam became much more powerful when it was driven by people’s reponses, and not by AI. Of course, Haute also must implement some smart algorithms if it thinks a site is legitimate, to perform correctly in the case where malware is being installed through an exploit. Finally, Haute also has the ability to allow users to report false positives. Because of this user input, which even includes an entire community site where users can compete against each others in terms of number of bad sites reported, Haute can respond much quicker to malicious websites, and de-blacklist fixed sites much quicker as well.

Last but not least, Haute is being worked on and designed by some very bright people with extensive experience in this area. As I said earlier, I’ve also had the chance to contribute some knowledge and code into the product, and I felt that the design was very solid and ready to be extended to other products if that path will ever open. It’s one of the reasons why Firefox support is something being worked on, and shows that Haute isn’t in any way hacked around IE.

While some of the ideas and concepts behind Haute may have been attempted by other companies and products before, I really feel that Haute has all the right stuff it needs to be user friendly, powerful, and pro-active in protecting its users. The community-centric approach will also surely pay off into making an even better product. In many ways, I see it as the iPhone of its kind (if you agree with me that the iPhone is a success story).

A New Direction

It is with great excitement (and a certain amount of nostalgia) that I would like to announce two important changes in my professional life and in the direction in which I will pursue my knowledge and work on Windows Internals. The first of these changes is my debut as an instructor for David Solomon’s Expert Seminars, and the second is my departure from ReactOS, effective immediately. These plans do not change in any way my internship at Apple which will take place during the summer.

Some time ago, I had the great privilege of being approached by David Solomon, a well-known and highly regarded computer expert, teacher, consultant and co-author of Windows Internals 4th Edition (and Inside Windows 2000, 3rd Edition). For the last couple of years, David had been working with Mark Russinovich, another respected figure in the world of Windows Internals, and co-founder of Winternals and Sysinternals as well as developer of some of the most useful Windows system tools available today. Apart from working on the two books (which Mark was a co-author of), they both provided trainings and seminars on Windows internals under the “David Solomon Expert Seminars” banner. As is widely known, Microsoft realized that Mark’s experience and amazing work on the NT platform through his articles and tools could provide a highly beneficial new addition to the company. The company bought Winternals last year, and hired Mark at the highest technical level in the company, Technical Fellow. 

All this is history of course, and back to the matter at hand, Mark’s recent new employment made him unavailable for teaching new classes, which made David Solomon start the search for a new instructor which could take on the responsibility of teaching new classes. I was highly honoured to have been chosen as this person, and accepted this unique opportunity to bring my knowledge out to many more people and to work with one of my most admired Windows experts

With this new job as an added task on top of my already busy life, as well as with the imminent Apple internship, I was already planning to cut back on my involvement with ReactOS. However, since it became clear that my level of contact with Microsoft employees and resources would be in conflict with my work at ReactOS, I made the difficult choice of amicably severing my ties with the project. This decision took some time for me to finalize, but the various motivations behind it had started cropping up since early this year

When I first joined ReactOS 3 years ago, the kernel was – in my opinion – highly disorganized and hodgepodge of Linux, NT 4, Wine and Windows 9x code which was very far from its actual goal of NT Driver compatibility. In fact, the development model seemed to focus on hacking NT drivers to work on ReactOS, and not vice-versa. Coincidentally, I joined the project just as the lead kernel developer, David Welch, had just burnt out and moved to other projects and goals. For the last three years, I rewrote key subsystems such as the thread scheduler, dispatcher, locking and IRQL mechanisms, HAL, executive support, object manager, process manager, I/O manager, basic VDM and 8086 support, and much more, as well as switched the project goals from NT4 to NT 5.2. 

My ability to do this came from my extensive reverse engineering of the kernel in the past, reading internals books, access to the DDK/IFS, as well as using WinDBG and .pdb type information. In return for all the code and guidance I provided, the project gave me a lot in return as well, including a unique perspective of working on such a project, the ability to work in large and distributed teams, and using open source tools for Windows NT kernel development. With millions of lines of code, ReactOS is the kind of project that an 18 year old could’ve only dreamt work ing on. I became adept in source control repositories, regression testing, unit testing, team management, IRC administration, as well as a much better coder in C. I also made friendships of all levels with various developers, testers and users, and had a chance to mentor two students during last year’s Google Summer of Code. I was able to attend and give talks on ReactOS, exhibit it, and make connections with other people in the industry, and in the open source world. Overall, it’s been an exhilarating adventure.

After three years however, and with the many new responsibilities that had kept growing, my free time grew short. Additionally, my work in the kernel had almost reached completion. The parts that still need major work, in my opinion, require extremely skilled developers in those areas to ever be as close to NT as needed. They are also some of the most critical: the memory manager, the cache manager, the Power/PnP Manager, the configuration manager and the file system runtime library. With the current differences that exist, most modern WDM drivers as well as IFS drivers can only dream of running properly. Unfortunately, my knowledge in those areas was limited. I had never reverse engineered them as extensively as parts of the executive, and documentation on their guts is limited.  In all honesty, they’re also not parts of the system that interest me much. I could, of course, have continued working on user-mode parts of the system where my help would still bring a lot of the system forward, such as ntdll, csrss, smss, winsock and kernel32, but my interest in teaching with David Solomon and getting in touch with the developers behind NT outweighed that desire.

After three years, I learned a tremendous amount of knowledge and skills while working on ReactOS, now the time has come for me to learn even more by expanding my horizons. In many ways, I had already outgrown the project, focusing more on security research, utilities and tools, articles and non-ReactOS related talks and conferences. It was time for me to step outside and take on a new opportunity with a larger audience and which would bring me many new experiences and teachings. I wish the ReactOS Project all the luck and I know that some significant new changes are on the horizon for them. I will keep watching from a distance, and I thank them for the most fun years of my life.

This blog will continue as usual, and I am currently working on the fourth part of the SDB series. Thank you for your continued readership and support!