IA-32lib

Document Revision 1.01.030609

This is intended to be a short description of the ia32lib library. You can use the links below to jump to the different sections. Use the HOME key on your keyboard to return to the top. If you want to get the big picture without reading too much, I would suggest skipping the Reference section. If you don't feel reading this at all, scan through the Examples section.

The IA32Lib toolkit has been designed and developed by Kamen Yotov. Direct your questions, suggestions and concerns to kamen@yotov.org.

We will appreciate any possible feedback you might have at every stage (usage, design, source, documentation...). Enjoy!

Download!

[ Introduction | Overview | Requirements | Installation | Reference | Examples | Future Work ]

Introduction

Modern processors based on IA-32 have performance counter registers that allow programmers to count different statistical values about their running applications. Programming these counters to count exactly what a programmer wants and reading their values requires access to the so called Model Specific Registers (MSRs) of the processor. There are several instructions in the IA-32 ISA that provide access to these special registers (e.g. RDMSR, WRMSR, RDPMC), but they are either privileged (restricted to be executed in kernel ring-0 mode only) or have some other restrictions, which combined with the security of the operating system, does not allow the application programmer to use them. The main purpose of the ia32lib library is to export user-programmable interface to these crucial performance measurement facilities and to provide appropriate ways for detailed processor detection (family, model, cache configuration, ...).

Both the library and its full source code are free for personal use and can be freely downloaded. I have not yet figured out the policy for commercial uses, but who knows...

Overview

ia32lib consists of two main parts plus some examples:

  • ia32.sys - a Windows NT/2K/XP Kernel-Mode Driver that provides access to IA-32's MSRs;
  • ia32.lib + ia32.h - a static library which provides easy interface to ia32.sys and some other nice features like CPU model and cache configuration detection;
  • ia32detect.cpp and ia32p6.cpp are two examples that use ia32lib. I will pay more attention to them later (See examples).

The sources to build ia32.sys are provided for completeness and educational purposes only. It is not advisable to try rebuilding ia32.sys unless you really know what you are doing. Further you will need Microsoft Windows NT Driver Development Kit (freely available from Microsoft's site).

Requirements

  • A PC running Microsoft Windows NT / 2K / XP;
  • Microsoft Visual C++ 6.0 or Microsoft Visual Studio.NET (7.0)
  • Optional: Intel C++ Optimizing Compiler 5.0.

NOTES: The compilers ia32lib has been tested with so far are Microsoft C++ 6.0 and 7.0 and Intel C++ Optimizing Compiler 5.0. Every effort has been made to make to code portable to other compilers, but no tests have been performed so far. The next compilers to look at will probably be Watcom C++ 11.0c and Borland C++ 5.5. Compatibility with the first three mentioned above is guaranteed as long the development effort continues.

Installation

  1. Download the distribution (If you have not already done so);
  2. Start ia32lib.exe - this will unpack all files to a directory of your choice;
  3. Install the ia32.sys Kernel-Mode Driver on your Windows system (step-by-step instructions);
  4. Open ia32lib.dsw workspace or ia32lib.sln solution with the appropriate version of Microsoft Visual C++ (6.0 or 7.0.NET respectively);
  5. You are ready to go! Try building the two sample programs (ia32detect and ia32p6).

NOTES: Installation of the ia32.sys driver does not require system restart on Windows XP. The installation instructions are also for Windows XP, but the steps should be isomorphic to the steps for Windows NT and Windows 2000. Note that at this point I have not tried the driver on Windows NT and Windows 2000, but it should work :). If you have any problems, mail me!

The directory structure of the distribution is as follows:

  • ia32 - root directory of the distribution
    • doc - documentation directory
      • steps - images for step-by-step driver installation
        • step0.png
        • ...
        • step15.png
      • generic.css - style sheet for the documentation
      • install.htm - step-by-step driver installation instructions
      • index.htm - main documentation file (similar to this one, if not the same :)
    • drv - NT Kernel-Mode Driver directory
      • source - driver sources
        • makefile - required part of the NT DDK environment build process
        • sources - required part of the NT DDK environment build process
        • ia32.c - main driver source, based on the portio example in the NT DDK
        • ring0.c - IA-32 assembly support routines
        • ring0.h - IA-32 assembly support routines header
        • ia32.rc - driver version info resource
      • ia32.inf - driver installation information file (needed by "Add Hardware Wizard" to install the driver)
      • ia32.sys - compiled binary of the driver itself
    • inc - host for all header files of the ia32lib library
      • ia32.h - main header files, includes all others. This is the only one you need to include
      • ia32cache.h - describes possible cache configurations for IA-32
      • ia32counter.h - defines abstract base class for performance counters
      • ia32def.h - defines basic types
      • ia32detect.h - defines the IA-32 CPU detection class
      • ia32driver.h - provides interface constants for use with the driver. Also included by the driver
      • ia32error.h - defines error exception class
      • ia32ring0.h - defines class to expose driver API to the application
      • ia32size.h - defines auxiliary class for managing memory sizes
      • p6counter.h - specializes ia32counter for the Intel P6 processor family (Pentium Pro, II and III)
    • lib - source files needed to build ia32.lib
      • ia32.cpp - used solely for pre-compiled header generation (Microsoft Visual C++ feature)
      • ia32cache.cpp - initializers for known cache configurations... needs additions, so keep an eye on it
      • ia32counter.cpp - initializes ia32counter's static variable "counter"
    • out - here all output files of the build process are placed
      • ia32.lib - pre-built relese version of the library
      • ia32detect.exe - pre-built release version of the ia32detect sample application
      • ia32p6.exe - pre-built release version of the ia32p6 sample application (requirest the ia32.sys driver to be installed)
    • prj - support files for Microsoft Visual C++
      • vc.6 - support files for Microsoft Visual C++ 6.0
        • ia32detect - ia32detect.exe sample application project directory
          • ia32detect.dsp - ia32detect sample application project file
        • ia32lib - ia32.lib library project directory
          • ia32lib.dsp - ia32.lib library project file
        • ia32p6 - ia32p6.exe sample application project directory
          • ia32p6.dsp - ia32p6.exe sample application project file
        • ia32lib.dsw - Microsoft Visual C++ 6.0 Project Workspace (open this thing inside the environment)
      • vc.net - support files for Microsoft Visual C++ .NET
        • ia32detect - ia32detect.exe sample application project directory
          • ia32detect.vcproj - ia32detect sample application project file
        • ia32lib - ia32.lib library project directory
          • ia32lib.vcproj - ia32.lib library project file
        • ia32p6 - ia32p6.exe sample application project directory
          • ia32p6.vcproj - ia32p6.exe sample application project file
        • ia32lib.sln - Microsoft Visual C++ .NET Solution (open this thing inside the environment)
    • src - examples source directory
      • ia32detect - source directory for the ia32detect.exe example
        • ia32detect.cpp - source for the ia32detect.exe example
      • ia32p6 - source directory for the ia32p6.exe example
        • ia32p6.cpp - source for the ia32p6.exe example

Reference

[ ia32def.h | ia32size.h | ia32error.h | ia32driver.h | ia32ring0.h | ia32cache.h | ia32detect.h | ia32counter.h | p6counter.h ]

This part is mostly top-down description of all features in the library. Each header file is discussed separately and in detail. Moreover, if you don't feel like reading, this is the part to skip :).

ia32def.h

types
  name equivalent
  byte unsigned char
  word

unsigned

  bit unsigned
  uint8 unsigned __int8
  uint16 unsigned __int16
  uint32 unsigned __int32
  uint64 unsigned __int64

Notes:

  • bit is used in structured bit-fields (see ia32detect.h for examples).

Back to Reference...

ia32size.h

Constants
  Name Value
  B (uint64)1
  KB (1024 * B)
  MB (1024 * KB)
  GB (1024 * MB)
  TB (1024 * TB)
Classes
  Name Definition
  ia32size
class ia32size
	{
	    uint64 size;
	public:
	    ia32size (uint64);
	    operator const string () const;
	    operator const uint64 () const;
	}

Notes:

  • ia32size's purpose is to convey memory sizes in easy to read textual format;
  • ia32size::ia32size(uint64) constructs an instance for a specific capacity value;
  • ia32size::operator string () const is used to convert the encapsulate value to a string (see example below);
  • ia32size::operator uint64 () const is used to return the encapsulated value in native integer format.

Example:

#include "ia32size.h"

	void main ()
	{
	    printf("%8s\n", ((string)ia32size(16)).c_str());
	    printf("%8s\n", ((string)ia32size(1024)).c_str());
	    printf("%8s\n", ((string)ia32size(4096)).c_str());
	    printf("%8s\n", ((string)ia32size(3 * 1024 * 1024)).c_str());
	    printf("%8s\n", ((string)ia32size((uint64)13 * 1024 * 1024 * 1024 * 1024)).c_str());
	    printf("%8s\n", ((string)ia32size((uint64)13 * 1024 * 1024 * 1024 * 1024 + (uint64)7 * 1024 * 1024 * 1024)).c_str());

	    printf("%8d\n", (uint64)ia32size(12345678));
	}
	

Output:

    16 B
	     1KB
	     4KB
	     3MB
	    13TB
	 13319GB
	12345678
	

Back to Reference...

ia32error.h

Classes
  Name Definition
  ia32error
class ia32error
	{
	public:
	    enum err_
	    {
	        err_generic,
	        err_ring0_cpu,
	        err_ring0_create,
	        err_ring0_ioctl,
	        err_ring0_size,
	        err_ring0_close,
	        err_counter_overflow,
	        err_counter_family,
	        err_counter_MMX,
	        err_counter_SSE,
	        err_counter_counter,
	        err_invalid
	    };

	    ia32error (err_);
	    operator const char * () const;
	protected:
	    err_ v;
	};

Notes:

  • ia32error is a class whose instances are thrown as exceptions;
  • enum ia32error::err_ enumerated the different error values;
  • ia32error::ia32error (err_) initializes an instance to a particular error value;
  • ia32error::operator const char * () const converts the encapsulated error value to a string (suitable for error display), for the list of specific string values look inside ia32error.h;
  • most ring-0 routines throw ia32errors as exceptions. For specific examples see the ia32p6 sample.

Back to Reference...

ia32driver.h

Constants
  Name Value
  IA32CPU_TYPE 40000
  IOCTL_IA32CPU_READ_MSR CTL_CODE(IA32CPU_TYPE, 0x900, METHOD_BUFFERED, FILE_READ_ACCESS)
  IOCTL_IA32CPU_WRITE_MSR CTL_CODE(IA32CPU_TYPE, 0x901, METHOD_BUFFERED, FILE_WRITE_ACCESS)

Notes:

  • Constants defined in this header are used both by the ia32.sys kernel-mode driver and by the ia32ring0.h driver interface header;
  • IOCTL_IA32CPU_XXX_MSR are needed to complete DeviceIoControl system calls to the driver.

Back to Reference...

ia32ring0.h

Classes
  Name Definition
  ia32ring0
class ia32ring0
	{
	    HANDLE h;
	public:
	    ia32ring0 ();
	    uint64 rdmsr (uint32 i) const;
	    void wrmsr (uint32 i, uint64 d) const;
	    ~ia32ring0 ();
	};
	

Notes:

  • ia32ring0 is the exported user-level API to the ia32.sys driver, used to read and write IA-32 Model Specific Registers (MSRs);
  • ia32ring0::ia32ring0 () initializes a connection to the driver;
  • uint64 ia32ring0::rdmsr (uint32 i) const uses the driver to read the i-th MSR and returns its value;
  • void ia32ring0::wrmsr (uint32 i, uint64 d) const uses the driver to write the i-th MSR with the value contained in d;
  • ia32ring0::~ia32ring0 () closes the connection to the driver.

Back to Reference...

ia32cache.h

Classes
  Name Definition
  ia32cache
class ia32cache
	{
	public:
	    enum type_
	    {
	        type_reserved,
	        type_unified,
	        type_instruction,
	        type_trace,
	        type_data,
	        type_invalid
	    };

	    enum _
	    {
	        level_TLB = -1,
	        associativity_Full = -1,
	        block_AnySize = 0
	    };

	    const byte descriptor;
	    const type_ type;
	    const int level;
	    const ia32size capacity;
	    const ia32size block;
	    const int associativity;

	    ia32cache (byte, type_, int, ia32size, ia32size, int);
	    operator const string () const;
	protected:
	    const const char * type_text () const;
	    const const string associativity_text () const;
	};
Variables
  Name Declaration
  ia32caches extern const ia32cache ia32caches[];
Functions
  Name Prototype
  _ia32cache const ia32cache &_ia32cache (byte);

Notes:

  • ia32cache is a class describing cache memory parameters. For now a number of predefined such classes exist (see ia32cache.cpp for complete listing), but in the future it will also be used to describe caches detected empirically by software;
  • enum ia32cache::type_ enumerates the different types of caches supported;
  • enum ia32cache::_ enumerates some special values for otherwise integer fields like block size and associativity;
  • const byte ia32cache::descriptor contains the IA-32 defined byte descriptor of the cache;
  • const type_ ia32cache::type contains the type of the cache;
  • const int ia32cache::level contains the cache level (-1 means TLB cache);
  • const ia32size ia32cache::capacity contains the size of the cache;
  • const ia32size ia32cache::block contains the block size of the cache (0 means "Any Size" for page sizes in TLB caches);
  • const int ia32cache::associativity contains the associativity of the cache (-1 means "Fully-Associative");
  • ia32cache::ia32cache (byte, type_, int, ia32size, ia32size, int) initializes a cache instance;
  • ia32cache::operator const string () converts the cache instance to a nice looking string representation (see the ia32detect sample for detailed examples);
  • const char * ia32cache::type_text () const returns a text representation of the current value of the type field;
  • const string associativity_text () const returns a text representation of the current value of the associativity field;
  • ia32cache ia32caches[] contains pre-initialized cache instances for all descriptors known so far;
  • const ia32cache &_ia32cache (byte) searches a cache instance by descriptor in the above array.

Back to Reference...

ia32detect.h

Classes
  Name Definition
  ia32error
class ia32detect
	{
	public:
	    enum type_
	    {
	        type_OEM,
	        type_OverDrive,
	        type_Dual,
	        type_reserved
	    };

	    enum brand_
	    {
	        brand_na,
	        brand_Celeron,
	        brand_PentiumIII,
	        brand_PentiumIIIXeon,
	        brand_reserved1,
	        brand_reserved2,
	        brand_PentiumIIIMobile,
	        brand_reserved3,
	        brand_Pentium4,
	        brand_invalid
	    };

	    struct version_
	    {
	        bit Stepping  : 4;
	        bit Model     : 4;
	        bit Family    : 4;
	        bit Type      : 2;
	        bit Reserved1 : 2;
	        bit XModel    : 4;
	        bit XFamily   : 8;
	        bit Reserved2 : 4;
	    };

	    struct misc_
	    {
	        byte Brand;
	        byte CLFLUSH;
	        byte Reserved;
	        byte APICId;
	    };

	    struct feature_
	    {
	        bit FPU       : 1; // Floating Point Unit On-Chip
	        bit VME       : 1; // Virtual 8086 Mode Enhancements
	        bit DE        : 1; // Debugging Extensions
	        bit PSE       : 1; // Page Size Extensions
	        bit TSC       : 1; // Time Stamp Counter
	        bit MSR       : 1; // Model Specific Registers
	        bit PAE       : 1; // Physical Address Extension
	        bit MCE       : 1; // Machine Check Exception
	        bit CX8       : 1; // CMPXCHG8 Instruction
	        bit APIC      : 1; // APIC On-Chip
	        bit Reserved1 : 1; 
	        bit SEP       : 1; // SYSENTER and SYSEXIT instructions
	        bit MTRR      : 1; // Memory Type Range Registers
	        bit PGE       : 1; // PTE Global Bit
	        bit MCA       : 1; // Machine Check Architecture
	        bit CMOV      : 1; // Conditional Move Instructions
	        bit PAT       : 1; // Page Attribute Table
	        bit PSE36     : 1; // 32-bit Page Size Extension
	        bit PSN       : 1; // Processor Serial Number
	        bit CLFSH     : 1; // CLFLUSH Instruction
	        bit Reserved2 : 1;
	        bit DS        : 1; // Debug Store
	        bit ACPI      : 1; // Thermal Monitor and Software Controlled Clock Facilities
	        bit MMX       : 1; // Intel MMX Technology
	        bit FXSR      : 1; // FXSAVE and FXRSTOR Instructions
	        bit SSE       : 1; // Intel SSE Technology
	        bit SSE2      : 1; // Intel SSE2 Technology
	        bit SS        : 1; // Self Snoop
	        bit Reserved3 : 1;
	        bit TM        : 1; // Thermal Monitor
	        bit Reserved4 : 2;
	    };

	    string vendor;
	    string brand;
	    version_ version;
	    misc_ misc;
	    feature_ feature;
	    byte *cache;

	    ia32detect ();
	    const string version_text () const;
	protected:
	    const char * type_text () const;
	    const string brand_text () const;
	private:
	    uint32 init0 ();
	    void init1 (uint32 *d);
	    void process2 (uint32 d, bool c[]);
	    void init2 (byte count);
	    void init0x80000000 ();
	};
	

Notes:

  • enum ia32detect::type_ enumerates CPU types for the version.Type field;
  • enum ia32detect::brand_ enumerates CPU brands for the misc.Brand field;
  • struct ia32detect::version_ (version field) describes CPU version information as returned by the CPUID instruction;
  • struct ia32detect::misc_ (misc field) describes CPU miscellaneous information as returned by the CPUID instruction;
  • struct ia32detect::feature_ (feature field) describes CPU feature information as returned by the CPUID instruction;
  • string ia32detect::vendor specifies the CPU vendor ("GenuineIntel" for Intel CPUs);
  • string ia32detect::brand specifies the CPU brand string, when supported;
  • byte *ia32detect::cache specifies a null terminated stream of cache descriptors;
  • ia32detect::ia32detect () initializes an instance of the class by (multiple) use of CPUID instruction;
  • const string ia32detect::version_text () returns a string representation of the version field;
  • const char *ia32detect::type_text () returns a string representation of the type field;
  • const string ia32detect::brand_text () returns a string representation of the misc.Brand field;
  • all the private members are auxiliary routines to simplify the work of the constructor.

Back to Reference...

ia32counter.h

Classes
  Name Definition
  ia32counter
class ia32counter
	{
	protected:
	    static uint32 count;
	    uint32 index;
	public:
	    ia32counter (uint32 counters);
	};
	

Notes:

  • ia32counter is an abstract base class for performance monitoring hardware counter;
  • static uint32 ia32counter::count accumulates the number of instances created;
  • uint32 ia32counter::index contains the hardware index of this instance;
  • ia32counter::ia32counter (uint32 counters) initializes the index and checks for structural hazards (enough hardware counters).

Back to Reference...

p6counter.h

Classes
  Name Definition
  p6counter
class p6counter: public ia32counter
	{
	public:
	    enum event_
	    {
	        // Data Cache Unit (DCU)
	        DCU_MEMORY_REFERENCE         = 0x43, // DATA_MEM_REFS
	        DCU_LINES_IN                 = 0x45,
	        DCU_M_LINES_IN               = 0x46,
	        DCU_M_LINES_OUT              = 0x47,
	        DCU_MISS_OUTSTANDING         = 0x48,

	        // Instruction Fetch Unit (IFU)
	        IFU_IFETCH                   = 0x80,
	        IFU_IFETCH_MISS              = 0x81,
	        IFU_TLB_MISS                 = 0x85, // ITLB_MISS
	        IFU_MEMORY_STALL             = 0x86,
	        IFU_ILD_STALL                = 0x87, // ILD_STALL

	        // L2 Cache
	        L2_IFETCH                    = 0x28,
	        L2_LOADS                     = 0x29, // L2_LD
	        L2_STORES                    = 0x2A, // L2_ST
	        L2_LINES_IN                  = 0x24,
	        L2_LINES_OUT                 = 0x26,
	        L2_M_LINES_IN                = 0x25,
	        L2_M_LINES_OUT               = 0x27,
	        L2_REQUEST                   = 0x2E, // L2_RQSTS
	        L2_ADDRESS_STROBE            = 0x21, // L2_ADS
	        L2_DATA_BUS_BUSY             = 0x22, // L2_DBUS_BUSY
	        L2_DATA_BUS_BUSY_READ        = 0x23, // L2_DBUS_BUSY_RD

	        // External Bus Logic (EBL)
	        EBL_DATA_READY               = 0x62, // BUS_DRDY_CLOCKS
	        EBL_LOCK                     = 0x63, // BUS_LOCK_CLOCKS
	        EBL_REQ_OUTSTANDING          = 0x60, // BUS_REQ_OUTSTANDING
	        EBL_TRANS_BURST_READ         = 0x65, // BUS_TRAN_BRD
	        EBL_TRANS_READ_OWNER         = 0x66, // BUS_TRAN_RFO
	        EBL_TRANS_WRITEBACK          = 0x67, // BUS_TRANS_WB
	        EBL_TRANS_IFETCH             = 0x68, // BUS_TRAN_IFETCH
	        EBL_TRANS_INVALIDATE         = 0x69, // BUS_TRAN_INVAL
	        EBL_TRANS_PARTIAL_WRITE      = 0x6A, // BUS_TRAN_PWR
	        EBL_TRANS_PARTIAL            = 0x6B, // BUS_TRANS_P
	        EBL_TRANS_IO                 = 0x6C, // BUS_TRANS_IO
	        EBL_TRANS_DEFERRED           = 0x6D, // BUS_TRAN_DEF
	        EBL_TRANS_BURST              = 0x6E, // BUS_TRAN_BURST
	        EBL_TRANS_ANY                = 0x70, // BUS_TRAN_ANY
	        EBL_TRANS_MEMORY             = 0x6F, // BUS_TRAN_MEM
	        EBL_DATA_RECEIVE             = 0x64, // BUS_DATA_RCV
	        EBL_DRIVE_BNR                = 0x61, // BUS_BNR_DRV
	        EBL_DRIVE_HIT                = 0x7A, // BUS_HIT_DRV
	        EBL_DRIVE_HITM               = 0x7B, // BUS_HITM_DRV
	        EBL_SNOOP_STALL              = 0x7E, // BUS_SNOOP_STALL

	        // Floating-Point Unit (FPU)
	        FPU_FLOPS_RETIRED            = 0xC1, // FLOPS,           Counter 0 only
	        FPU_FLOPS_EXECUTED           = 0x10, // FP_COMP_OPS_EXE, Counter 0 only
	        FPU_ASSIST                   = 0x11, // FP_ASSIST,       Counter 1 only
	        FPU_MUL                      = 0x12, // MUL,             Counter 1 only
	        FPU_DIV                      = 0x13, // DIV,             Counter 1 only
	        FPU_DIV_BUSY                 = 0x14, // CYCLES_DIV_BUSY, Counter 0 only

	        // Memory Ordering (MO)
	        MO_LOAD_BLOCKED              = 0x03, // LD_BLOCKS
	        MO_STORE_BUFFER_DRAIN        = 0x04, // SB_DRAINS
	        MO_MISALLIGNMENT             = 0x05, // MISALIGN_MEM_REF
	        SSE_PREFETCH_DISPATCHED      = 0x07, // EMON_KNI_PREF_DISPATCHED
	        SSE_PREFETCH_MISS            = 0x4B, // EMON_KNI_PREF_MISS

	        // Instruction Decoding and Retirement (IDR)
	        IDR_INSTRUCTION_RETIRED      = 0xC0, // INST_RETIRED
	        IDR_UOP_RETIRED              = 0xC2, // UOPS_RETIRED
	        IDR_INSTRUCTION_DECODED      = 0xD0, // INST_DECODED
	        SSE_INSTRUCTION_RETIRED      = 0xD8, // EMON_KNI_INST_RETIRED
	        SSE_COMPUTATION_RETIRED      = 0xD9, // EMON_KNI_COMP_INST_RET

	        // Interrupts (INT)
	        INT_HW_RECEIVED              = 0xC8, // HW_INT_RX
	        INT_MASKED                   = 0xC6, // CYCLES_INT_MASKED
	        INT_PENDING_AND_MASKED       = 0xC7, // CYCLES_INT_PENDING_AND_MASKED

	        // Branches (BR)
	        BR_INSTRUCTION_RETIRED       = 0xC4, // BR_INST_RETIRED
	        BR_MISSPREDICT_RETIRED       = 0xC5, // BR_MISS_PRED_RETIRED
	        BR_TAKEN_RETIRED             = 0xC6,
	        BR_MISSPREDICT_TAKEN_RETIRED = 0xC7, // BR_MISS_PRED_TAKEN_RET
	        BR_INSTRUCTION_DECODED       = 0xE0, // BR_INST_DECODED
	        BR_BTB_MISS                  = 0xE2, // BTB_MISSES
	        BR_BOGUS                     = 0xE4,
	        BR_BACLEAR                   = 0xE6, // BARCLEARS

	        // Stalls (STALL)
	        STALL_RESOURCE               = 0xA2, // RESOURCE_STALLS
	        STALL_PARTIAL                = 0xD2, // PARTIAL_RAT_STALLS

	        // Multimedia Extensions (MMX)
	        MMX_INSTRUCTION_EXECUTE      = 0xB0, // MMX_INSTR_EXEC
	        MMX_SATURATING_EXECUTE       = 0xB1, // MMX_SAT_INSTR_EXEC
	        MMX_UOP_EXECUTE              = 0xB2, // MMX_UPOS_EXEC
	        MMX_TYPE_EXECUTE             = 0xB3, // MMX_INSTR_TYPE_EXEC
	        MMX_FPU_TRANSITION           = 0xCC, // FP_MMX_TRANS
	        MMX_ASSIST                   = 0xCD,
	        MMX_INSTRUCTION_RETIRED      = 0xCE, // MMX_INSTR_RET

	        // Segment Register Renaming (SRR)
	        SRR_STALL                    = 0xD4, // SEG_RENAME_STALLS
	        SRR_COUNT                    = 0xD5, // SEG_REG_RENAME
	        SRR_COUNT_RETIRED            = 0xD6, // RET_SEG_RENAMES

	        SEGMENT_REGISTER_LOADS       = 0x06, // SEGMENT_REG_LOADS
	        CPU_CLOCKS_UNHALTED          = 0x79  // CPU_CLK_UNHALTED
	    };

	    enum mask_
	    {
	        NONE                      = 0x0,

	        L2_M                      = 0x8,
	        L2_E                      = 0x4,
	        L2_S                      = 0x2,
	        L2_I                      = 0x1,
	        L2_MESI                   = 0xF,

	        EBL_SELF                  = 0x00,
	        EBL_ANY                   = 0x20,

	        SSE_PREFETCH_NTA          = 0x00,
	        SSE_PREFETCH_T1           = 0x01,
	        SSE_PREFETCH_T2           = 0x02,
	        SSE_WEAKLY_ORDERED_STORES = 0x03,

	        SSE_PACKED_AND_SCALAR     = 0x00,
	        SSE_SCALAR                = 0x01,

	        MMX_PACKED_MULTIPLY       = 0x01,
	        MMX_PACKED_SHIFT          = 0x02,
	        MMX_PACK                  = 0x04,
	        MMX_UNPACK                = 0x08,
	        MMX_PACKED_LOGICAL        = 0x10,
	        MMX_PACKED_ARITHMETIC     = 0x20,
	        MMX_ANY                   = 0x3F,

	        MMX_TO_FPU                = 0x0,
	        MMX_FROM_FPU              = 0x1,

	        SRR_ES                    = 0x1,
	        SRR_DS                    = 0x2,
	        SRR_FS                    = 0x4,
	        SRR_GS                    = 0x8,
	        SRR_ANY                   = 0xF
	    };

	    struct
	    {
	        bit event    : 8;
	        bit mask     : 8;
	        bit ring123  : 1;
	        bit ring0    : 1;
	        bit edge     : 1;
	        bit pin      : 1;
	        bit int_     : 1;
	        bit reserved : 1;
	        bit enable   : 1;
	        bit invert   : 1;
	        bit count    : 8;
	    } config;

	    p6counter (event_ event, mask_ mask = NONE, byte count = 0, bool invert = false);
	    operator const uint64 () const;
	protected:
	    ia32ring0 r0;
	};
	

Notes:

  • p6counter is a derived class of ia32counter for performance monitoring counter on the Intel P6 Family of CPUs (Pentium Pro, II and III);
  • enum p6counter::event_ enumerates the different events this counter can be programmed to count;
  • enum p6counter::mask_ enumerates the different values for the mask field in the counter programming register;
  • struct p6counter::config represents the counter's programming register;
  • p6counter::p6counter (event_, mask_, byte, invert) initilizes the hardware counter and starts it;
  • p6counter::operator uint64 () const reads the current value of the counter;
  • ia32ring0 p6counter::r0 is used for communication with the kernel-mode driver.

Back to Reference...

Examples

ia32detect

This examples fully exploits the features for CPU detection. Here you can find demonstrated all the supported features. Provided below is the complete source code (not much).

#include "ia32.h"

	void main ()
	{
	    ia32detect ia32;

	    printf("Vendor  = %s\n\n", ia32.vendor.c_str());
	    printf("Brand   = %s\n\n", ia32.brand.c_str());
	    printf("Version = %s\n\n", ia32.version_text().c_str());
	    printf("Cache: \n\n");

	    for (int i = 0; ia32.cache[i]; i++)
	        printf("%s\n", ((string)_ia32cache(ia32.cache[i])).c_str());

	    printf("\nFeatures:\n\n");

	    printf("%c %s\n", ia32.feature.FPU   ? '+' : '-', "Floating Point Unit On-Chip");
	    printf("%c %s\n", ia32.feature.VME   ? '+' : '-', "Virtual 8086 Mode Enhancements");
	    printf("%c %s\n", ia32.feature.DE    ? '+' : '-', "Debugging Extensions");
	    printf("%c %s\n", ia32.feature.PSE   ? '+' : '-', "Page Size Extensions");
	    printf("%c %s\n", ia32.feature.TSC   ? '+' : '-', "Time Stamp Counter");
	    printf("%c %s\n", ia32.feature.MSR   ? '+' : '-', "Model Specific Registers");
	    printf("%c %s\n", ia32.feature.PAE   ? '+' : '-', "Physical Address Extension");
	    printf("%c %s\n", ia32.feature.MCE   ? '+' : '-', "Machine Check Exception");
	    printf("%c %s\n", ia32.feature.CX8   ? '+' : '-', "CMPXCHG8 Instruction");
	    printf("%c %s\n", ia32.feature.APIC  ? '+' : '-', "APIC On-Chip");
	    printf("%c %s\n", ia32.feature.SEP   ? '+' : '-', "SYSENTER and SYSEXIT instructions");
	    printf("%c %s\n", ia32.feature.MTRR  ? '+' : '-', "Memory Type Range Registers");
	    printf("%c %s\n", ia32.feature.PGE   ? '+' : '-', "PTE Global Bit");
	    printf("%c %s\n", ia32.feature.MCA   ? '+' : '-', "Machine Check Architecture");
	    printf("%c %s\n", ia32.feature.CMOV  ? '+' : '-', "Conditional Move Instructions");
	    printf("%c %s\n", ia32.feature.PAT   ? '+' : '-', "Page Attribute Table");
	    printf("%c %s\n", ia32.feature.PSE36 ? '+' : '-', "32-bit Page Size Extension");
	    printf("%c %s\n", ia32.feature.PSN   ? '+' : '-', "Processor Serial Number");
	    printf("%c %s\n", ia32.feature.CLFSH ? '+' : '-', "CLFLUSH Instruction");
	    printf("%c %s\n", ia32.feature.DS    ? '+' : '-', "Debug Store");
	    printf("%c %s\n", ia32.feature.ACPI  ? '+' : '-', "Thermal Monitor and Software Controlled Clock Facilities");
	    printf("%c %s\n", ia32.feature.MMX   ? '+' : '-', "Intel MMX Technology");
	    printf("%c %s\n", ia32.feature.FXSR  ? '+' : '-', "FXSAVE and FXRSTOR Instructions");
	    printf("%c %s\n", ia32.feature.SSE   ? '+' : '-', "Intel SSE Technology");
	    printf("%c %s\n", ia32.feature.SSE2  ? '+' : '-', "Intel SSE2 Technology");
	    printf("%c %s\n", ia32.feature.SS    ? '+' : '-', "Self Snoop");
	    printf("%c %s\n", ia32.feature.TM    ? '+' : '-', "Thermal Monitor");
	}
	

Below is the output from my laptop machine. Please, if you decide to install the package, run this small problem and e-mail me the results.

Vendor  = GenuineIntel

	Brand   = Intel(R) Pentium(R) III Mobile CPU      1000MHz

	Version = 6.11.1 Intel OEM Processor XVersion(0.0)

	Cache: 

	0x01: TLB instruction, Entries( 32), PageSize(4KB), Associativity(4-way)
	0x02: TLB instruction, Entries(  2), PageSize(4MB), Associativity( Full)
	0x03: TLB        data, Entries( 64), PageSize(4KB), Associativity(4-way)
	0x04: TLB        data, Entries(  8), PageSize(4MB), Associativity(4-way)
	0x08: L1 instruction$, Size(  16KB), Block(  32 B), Associativity(4-way)
	0x0c: L1        data$, Size(  16KB), Block(  32 B), Associativity(4-way)
	0x83: L2     unified$, Size( 512KB), Block(  32 B), Associativity(8-way)

	Features:

	+ Floating Point Unit On-Chip
	+ Virtual 8086 Mode Enhancements
	+ Debugging Extensions
	+ Page Size Extensions
	+ Time Stamp Counter
	+ Model Specific Registers
	+ Physical Address Extension
	+ Machine Check Exception
	+ CMPXCHG8 Instruction
	- APIC On-Chip
	+ SYSENTER and SYSEXIT instructions
	+ Memory Type Range Registers
	+ PTE Global Bit
	+ Machine Check Architecture
	+ Conditional Move Instructions
	+ Page Attribute Table
	+ 32-bit Page Size Extension
	- Processor Serial Number
	- CLFLUSH Instruction
	- Debug Store
	- Thermal Monitor and Software Controlled Clock Facilities
	+ Intel MMX Technology
	+ FXSAVE and FXRSTOR Instructions
	+ Intel SSE Technology
	- Intel SSE2 Technology
	- Self Snoop
	- Thermal Monitor
	

ia32p6

This example demonstrates the usage of Intel P6 Hardware Performance Monitoring Counters. Processors from this family have two almost identical counters. In the source below, one of them is setup to count memory references and the other - to count requests to the L2 cache (which is actually nothing else but L1 misses!).

#include "ia32.h"
	#include "p6counter.h"

	void main ()
	{
	    p6counter c1(p6counter::L2_REQUEST, p6counter::L2_MESI);
	    p6counter c2(p6counter::DCU_MEMORY_REFERENCE);

	    const int c = 10000000;
	    static int a[c];

	    for (int ai1 = 0; ai1 < c; ai1++)
	        a[ai1]++;

	    SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);

	    uint64 t1 = c1;
	    uint64 t2 = c2;

	    for (int ai2 = 0; ai2 < c; ai2++)
	        a[ai2] *= 13;

	    printf("L1 misses   = %I64d\nL1 accesses = %I64d\n", c1 - t1, c2 - t2);
	}
	

We walk an array of 10000000 integers, multiplying each element by 13 (a load access, followed by a store access, i.e. 2 accesses per element). Also because the L1 line size is 32 bytes, we have 8 elements per line or about 12500000 cache lines accessed (all misses). This totals up to 20000000 memory accesses and 12500000 L1 misses. The excess of 1495 misses and 10728 memory accesses in the results below is due to OS noise, the amount of which (<<1%) is quite acceptable.

L1 misses   = 1251495
	L1 accesses = 20010728
	

The code of the example employs many techniques to reduce the noise during measurements. Here are the most important things you need to keep in mind when monitoring performance in this setting:

  1. Microsoft Windows NT / 2K / XP does not allocate all the memory your process requested instantly after the request. Rather pages are allocated when they are first accessed. This means that when you access a memory page for the first time, a page fault occurs and the OS takes over. The instructions executed by the OS exception handler can be millions, resulting in excessive noise in the measurements. For this reason the code above walks the array in advance to make sure all pages are present in memory when the counting starts.
  2. Because Microsoft Windows NT / 2K / XP is a preemptive multitasking operating system, our program is not the only thing running on the machine. Performance counters are in the CPU and they count for all processes simultaneously. In order to reduce foreign code noise, it is advisable to boost the priority of your process to maximum level (real-time priority). This setting will reserve the machine almost exclusively to your application and the overall responsiveness might seem jerky until the program terminates. The code above achieves the priority boost by the SetPriorityClass Windows system call: SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
  3. Last but not least, make sure you avoid obvious counting overlaps. An example would be to split the final printf statement in the example above in to different function calls. Note that the current counter value is read when the '-' sign is evaluated. Thus if you print the delta of the first counter (cache misses in this case) in a separate function call to printf, the second counter (memory references in this case) will count the data accesses performed during this function call as well.

Future Work

Although this document seems quite long, it is more of a draft than something completed.

There are many (orthogonal) directions this work can be extended.

First priority is of course implementing ia32counter subclasses (like p6counter) for other processor families, like Intel Pentium 4, Intel Ithanium and different models of AMD. I believe it is important to understand the specifics of Intel P4, as it is the first processor ever to provide precise event-based sampling performance monitoring. What this means is that one can get the processor state when an event (e.g. cache miss) occurs, so the exact instruction causing the miss is known. This can further facilitate the preciseness of research methods in this area.

Another direction is to extend the CPU detection procedure with empirical measurements that can detect memory hierarchy in conventional software (a la HW1 cs612). As processors become more and more sophisticated from hardware point of view, this task becomes harder and harder, but I believe it is still doable. This is very important step if we want to build compilers that dynamically tune themselves to the current CPU (possibly a CPU that did not exist when the compiler was released!)

Last, I am not sure how important this is, but this document is way too long and needs better structure and probably some factoring. If the library grows bigger, better documentation will be needed or it will be yet one of these public domain things that you need to read all the headers before starting to use it. I said this before, and I will repeat it again: If you ever plan to use this thing, please, please give feedback. Contributions are also more than welcome, but I would suggest if you have an idea to coordinate it with me, as there is good chance it is already under way...

So far I am not worried if this piece of software is useful or not. For sure it is useful for me. I bet it would also be useful for cs612... I hope it is useful for you too. Good luck!

References

  1. Intel IA-32 Developper Manuals v.1 - 3, http://www.intel.com
  2. http://www.sandpile.org