Steam GCF File Format

Last update: 23 January 2007

Added info on the fragmentation map which Ryan used to add defragmentation to the latest GCFScape. Proper defragmention of the .gcf files has had a significant and noticeable performance gain (for example, tests show load times for 8% (on average) fragmented .gcf files speed up by at least 1.25 times after defragmentation).

Last update: 05 September 2004

Added link to Ryans HLLib library. See the "sample code" section.

Last update: 01 July 2004

Ryan recieved some info from Timo Scripf the the last change in the GCF format has made the version number quite apparent in the GCF. Version 5 of the GCF format uses the GCF Block Map Heade but version 6 doesnt. Ryan has provided updated source files to reflect these changes.

Last update: 26 June 2004

Ryan has updated the format and code to reflect the changes in Steam from the 21st Juned 2004. I've also reverted to Ryan's code notation format for the sake of backwards compatibility with original programs based on his code.

Last update: 06 June 2004

Ryan and myself have recieved some additional information on the file format from "Addict". He's managed to figure out what some of the unknown sections are. The format specifications have been updated and the sample C++ files have had the relevant code added.

Introduction

Not so long ago Valve rolled out their new Steam content delivery and game network service (for want of a better description) which is destined to replace their old WON network. Apart from a GUI and all the other gubbins that comes with it, Valve have opted to place all of the files for each game into a game cache file which was unpopular with some who mod Half-Life based games. See, the problem was they locked all the files in the cache, but never provided a tool to get them out...

There is one GCF extractor currently floating around called Steam Dump It which interfaces to the filesystem API in steam.dll but its a bit slow and extracts everything, so you cant select individual files. Also, theres a bit of controvesy surrounding it as its rumoured it was made by OGC who are a source of many of the cheats found in Half-Life games. There is also the issue of how they figured out the API calls without using the leaked Half-Life 2 source code however I'm not touching that argument with a ten foot barge pole.

Anyhew, I've been taking a somewhat old fashioned approach and have had a go at cracking the GCF format itself in its raw format so that it should be possible to write some sort of extractor or utility without any dependancy on Steam. How I've been cracking the format is somewhat longwinded - I've been using a hex editor and a calculator and looking for patterns and values that match certain known aspects about the file and its contents. Its sounds pretty nuts but it works for me.

Ryan Gregg of Nem's Tools has also been working on the GCF format in the same way and we've decided to combine our findings and publish them here for the good of the Half-Life modding community. Ryan cracked a significant ammount of the file format and is currently developing a stand-alone GCF extractor called GCFScape which although in beta, does allow you to extract individual files from the GCF.

Ryan is currently developing GCFScape and other code/utilities for the GCF format in .NET and I am handling documentation.

So, below is what I know about the GCF file format so far. Your welcome to use it and if you manage to expand on it and make any further progress I'd appreciate it if you could let me know of your findings. Remember, this is a work in progress.

Basic overview of the GCF Format

Right, first things first, what is the GCF format? Well its a virtual file system, just like a disk file system except that the entire thing is stored in one big file. Based on this, it conforms to the the basic needs of a file system and works in a similar way. At least, this is the assumtion we're making.

Anyway, in short, the GCF file has a header followed by the actually data. Data is divided up between 8kb blocks (8192 bytes) and a file can occupy more than one block. These blocks do not have to be sequential.

Sample Code

Ryan has recently release a C++ package library called HLLib which abstracts several Half-Life package formats and provides a simple interface for them.

HLLib is an open source library licensed under the LGPL license. It comes with the source code and binaries necessary to use it. An example application called HLExtract is also included. HLExtract is a command line utility that can load all HLLib supported packages and extract multiple items from them while maintaining their directory structure. The entire application is under 150 lines (with about 20 lines that do all the work) and shows just how easy HLLib is to use.

Thomas Kaiser has contributed the following code which can be used to check the GCF file checksums:

"I figured out the GCF checksum algorithm and wrote an implementation in C. Note that the adler32 and crc32 functions used are from the zlib source code. "

// actual checksum algorithm
unsigned long gcf_checksum ( unsigned long checksum,
                             unsigned char * buffer,
                             unsigned long buffersize )
{
	return adler32(checksum, buffer, buffersize) ^ crc32(checksum, bufer, bufersize);
}

#define CHUNK_SIZE 32768
int main( int argc, char * argv[] )
{
    if( argc >= 2 )
    {
        int i;
        long file_size;
        unsigned char file_chunk[CHUNK_SIZE];
        FILE * file = NULL;
        file = fopen( argv[1], "rb" );
        if( file == NULL )
        {
            printf( "File could not be opened.\r\n" );
            goto END;
        }
        // get file size
        fseek( file, 0, SEEK_END );
        file_size = ftell(file);
        fseek( file, 0, SEEK_SET );

        // calculate checksums for every 32768 bytes
        for( i = 0; i < file_size / CHUNK_SIZE; i++ )
        {
            fread( file_chunk, CHUNK_SIZE, 1, file );
            printf( "Checksum: 0x%08X\r\n", gcf_checksum( 0, file_chunk, CHUNK_SIZE ) );
        }
        // checksum any straggler bytes at the end
        file_size = file_size % CHUNK_SIZE;
        if( file_size > 0 )
        {
            fread( file_chunk, 1, file_size, file );
            printf( "Checksum: 0x%08X\r\n", gcf_checksum( 0, file_chunk, file_size ) );
        }

        // finally, close the handle
        fclose( file );
    }

END:
#ifdef _DEBUG
    printf( "\r\nPress any key to continue..." );
    getchar();
    printf( "\r\n" );
#endif
    return 0;
} 
			

If you have written your own code based on this for other development languages and would like to contribute it, please feel free to send it to either Ryan or myself.

GCF structure and layout

Based on the Ryan's and Addict's work so far the GCF file seems to be layed out in the following manner:

  • GCF File Header
  • Blocks
  • Fragmentation Map
  • Block Entry Usage Map
  • Directory
  • Directory Map
  • Checksums
  • Data Blocks

The directory structure appears as follows:

  • GCFDirHeader - GCF directory header
  • GCFDirEntry - GCF directory entries
  • GCF directory names
  • GCFDirInfo1Entry - GCF directory info 1
  • GCFDirInfo2Entry - GCF directory info 2
  • GCFDirCopyEntry - GCF directory copy entries
  • GCFDirLocalEntry - GCF directory local entries

Most values inside the GCF are stored as DWORD 32-bit (4 byte) values.

GCF File Header

This is at the start of every GCF file and seems to be constant in its format throughout them:

//GCF Header
typedef struct tagGCFHEADER
{
	DWORD Dummy0;		// Always 0x00000001
	DWORD Dummy1;		// Always 0x00000001
	DWORD Dummy2;		// Always 0x00000005
	DWORD CacheID;
	DWORD GCFVersion;
	DWORD Dummy3;
	DWORD Dummy4;
	DWORD FileSize;		// Total size of GCF file in bytes.
	DWORD BlockSize;	// Size of each data block in bytes.
	DWORD BlockCount;	// Number of data blocks.
	DWORD Dummy5;
} GCFHEADER, *LPGCFHEADER;
			

Notes:

1. CacheID numbers seem to be confirmed by values in the registry. The ClientGameInfo.vcf file lists each game installed in Steam as having a number, for example Day of Defeat is number "30". Within this definition is a Primary Cache ID. In the case of DoD this is number 31 which matches this value in the header of Day of Defeat.gcf.

Also, in the Windows registry under Valve\Steam\Apps\30 are a number of folders with numbers relating to cache ID's. I assume these are dependancies for the game, i.e. in DoD's case it requires 0, 1, 2, 3 and 31. Matching these numbers against the cache id's in the GCF files gives us:

0 = half-life engine.gcf
1 = half-life.gcf
2 = half-life localized.gcf
3 = platform.gcf
31 = day of defeat.gcf

This seem logical that DoD would depend on these base files.

2. Each GCF file also has a version number. Again, looking in the Windows registry, where cache ID's are listed there is also a key called "LastVersionPlayed". For each GCF the value in the registry matches this value. These version numbers also seem to match with version numbers displayed when updating the stand alone HLDS.

3. The BlockSize value currently always seems to be set to 8192 (8k). File data is divided up and stored in 8k blocks. A file smaller than 8k will be padded at the end with zero's before the next file starts. Those that are larger than 8k are split over several 8k blocks with any last, incomplete block being padded out at the end.

Blocks

The Block Entry section stores information for piecing together chunks of files in a GCF file. It contains a map or table for piecing multiple chunks together back into a file along with information to find and defrag each block in the chunk. It is the core of all file reconstruction.

There isn't really any link between it and the Fragmentation Map other then the FirstDataBlockIndex is used in the Fragmentation Map when you go to extract a file (or should I say chunk of a file). There is no reason to require more then one Block Entry to specify a file other then maybe a performance issue when updating existing files which Steam was designed for.

The entries list is preceeded by a short header:

//GCF Block Header
typedef struct tagGCFBLOCKENTRYHEADER
{
	DWORD BlockCount;	// Number of data blocks.
	DWORD BlocksUsed;	// Number of data blocks that point to data.
	DWORD Dummy0;
	DWORD Dummy1;
	DWORD Dummy2;
	DWORD Dummy3;
	DWORD Dummy4;
	DWORD Checksum;		// Header checksum.
} GCFBLOCKENTRYHEADER, *LPGCFBLOCKENTRYHEADER;
			

Notes:

1. The checksum is simply the sum total of all the preceeding DWORDs in the header.

There then follows the block entries. There are as many block entries as there are blocks in the file, regardless if they contain actually usable data or not.

//GCF Block Entry
typedef struct tagGCFBLOCKENTRY
{
	DWORD EntryType;		// Flags for the block entry.  0x200F0000 == Not used.
	DWORD FileDataOffset;		// The offset for the data contained in this block entry in the file.
	DWORD FileDataSize;		// The length of the data in this block entry.
	DWORD FirstDataBlockIndex;	// The index to the first data block of this block entry's data.
	DWORD NextBlockEntryIndex;	// The next block entry in the series.  (N/A if == BlockCount.)
	DWORD PreviousBlockEntryIndex;	// The previous block entry in the series.  (N/A if == BlockCount.)
	DWORD DirectoryIndex;		// The index of the block entry in the directory.
} GCFBLOCKENTRY, *LPGCFBLOCKENTRY;
			

Notes:

1. EntryType can be one of three values depending on if the block is used or not. Values are:

  • 0x200F8000 - Block contains data.
  • 0x200F0000 - Block contains no data or is unused
  • 0x200FC000 - Block contains data (read only?)

As a general rule, the value 0x200F0000 means it contains no data, anything else does (I don't know what the difference between 0x200F8000 and 0x200FC000 is; you should note that these are the only other observed values so far though).

2. FileDataOffset defines at which offset in the extracted file this block of data is located.

3. FileDataSize defines how many bytes of data this block contains.

4. NextBlockEntryIndex and PreviousBlockEntryIndex are used to piece together block entries as a file can be made of multiple blocks. The values for either will equal BlockCount if there is no previous or next value.

Ryan gives an example of using this below:

"So let's say we want to extract directory item 1:

We would scan the Block Entries looking for a block entry with a corresponding Directory Index. The Block Entry we found could be in the middle of the file for all we know (as shown in the above example); to find the first one we would use the Previous Entry index.

The above code would give us the first block entry in the file."

5. Ryan explains the DataOffset and DataLength as follows:

"The DataOffset and DataLength are for the file you are extracting to, not values within the GCF files. GCF files are split into 8 KB blocks; you need the defragmentation map to defrag them. DataLength is the amount of data you can extract using the defragmentation map and starting from the FirstBlock (Ceiling(DataLength / BlockSize) blocks I believe.)"

6. DirectoryIndex refers to the index in the directory that the BlockEntry contains data for.

Fragmentation Map

As far as Im aware, the fragmentation map dictates how the blocks the make up the file are located inside the GCF file whereas the Block Entries dictate where data lies in the extracted file. The header contains a simple block count and checksum calculated in the same way as with the block entry header.

//GCF Fragmentation Map Header
typedef struct tagGCFFRAGMAPHEADER
{
	DWORD BlockCount;	// Number of data blocks.
	DWORD Dummy0;	// index of 1st unused GCFFRAGMAP entry?
	DWORD Dummy1;
	DWORD Checksum;		// Header checksum.
} GCFFRAGMAPHEADER, *LPGCFFRAGMAPHEADER;
			

What follows the header is a simple list of DWORD's for the fragmentation map.

//GCF Fragmentation Map Entry
typedef struct tagGCFFRAGMAP
{
	DWORD NextDataBlockIndex;	// The index of the next data block.
} GCFFRAGMAP, *LPGCFFRAGMAP;
			

Notes:

1. NextDataBlockIndex is the index of the next data block indexed into by the index of the first data block (from the block entries.) This defrags the files.

2. To extract a block entry you would write the First Block then use the index of the First Block in the fragmentation map to get the index of the NextDataBlockIndex. NextDataBlockIndex is BlockCount when you are done.

Ryan explains a little more: "The defragmentation map is quite simple to use. Let's say you had a file that was composed of three blocks. The index of the first data block would be the value of the FirstBlock field in the Block Entries. The index of the second data block would be the value contained at the index of the first data block in the Defragmentation Map. The index of the third data block would be the value contained at the index of the second data block in the Defragmentation Map. And lastly you should find that the value at the index of the third data block in the defragmentation map is BlockCount."

3. Dummy0 appears to be the index of the first unused GCFFRAGMAP entry, if all the entries are unused then it appears to be 0 (though sometimes it is any other valid index, not sure why). Not 100% about this one.

4. Dummy1 takes the values 0 or 1. In the case of 0, the value 0x0000ffff for NextDataBlockIndex in GCFFRAGMAP means there is no next index. In the case of 1, the value 0xffffffff for NextDataBlockIndex in GCFFRAGMAP means there is no next index. These values are also consistent with the size of the .gcf (if the .gcf has more than 65534 data blocks it needs the later). These special values are needed because a value of BlockCount for NextDataBlockIndex means the entry is unused.

Block Entry Usage Map

This section only exists in version 5 and lower format GCF files and was removed in the June 21, 2004 Steam update.

This section allows you to navigate the Block Entries in a more efficient manner as it only takes you through Block Entries that are used.

//GCF Block Map Header
typedef struct tagGCFBLOCKENTRYMAPHEADER
{
	DWORD BlockCount;		// Number of data blocks.	
	DWORD FirstBlockEntryIndex;	// Index of the first block entry.
	DWORD LastBlockEntryIndex;	// Index of the last block entry.
	DWORD Dummy0;
	DWORD Checksum;			// Header checksum.
} GCFBLOCKENTRYMAPHEADER, *LPGCFBLOCKENTRYMAPHEADER;
			
//GCF Block Map Entry
typedef struct tagGCFBLOCKENTRYMAP
{
	DWORD PreviousBlockEntryIndex;	// The previous block entry.  (N/A if == BlockCount.)
	DWORD NextBlockEntryIndex;	// The next block entry.  (N/A if == BlockCount.)
} GCFBLOCKENTRYMAP, *LPGCFBLOCKENTRYMAP;
			

Directory

The directory defines the actually file heirarchy and layout of the files once extracted from the GCF.

//GCF Directory Header
typedef struct tagGCFDIRECTORYHEADER
{
	DWORD Dummy0;		// Always 0x00000004
	DWORD CacheID;		// Cache ID.
	DWORD GCFVersion;	// GCF file version.
	DWORD ItemCount;	// Number of items in the directory.	
	DWORD FileCount;	// Number of files in the directory.
	DWORD Dummy1;		// Always 0x00008000
	DWORD DirectorySize;	// Size of lpGCFDirectoryEntries & lpGCFDirectoryNames & lpGCFDirectoryInfo1Entries & lpGCFDirectoryInfo2Entries & lpGCFDirectoryCopyEntries & lpGCFDirectoryLocalEntries in bytes.
	DWORD NameSize;		// Size of the directory names in bytes.
	DWORD Info1Count;	// Number of Info1 entires.
	DWORD CopyCount;	// Number of files to copy.
	DWORD LocalCount;	// Number of files to keep local.
	DWORD Dummy2;
	DWORD Dummy3;
	DWORD Checksum;		// Header checksum.
} GCFDIRECTORYHEADER, *LPGCFDIRECTORYHEADER;

//GCF Directory Entry
typedef struct tagGCFDIRECTORYENTRY
{
	DWORD NameOffset;	// Offset to the directory item name from the end of the directory items.
	DWORD ItemSize;		// Size of the item.  (If file, file size.  If folder, num items.)
	DWORD ChecksumIndex;	// Checksum index. (0xFFFFFFFF == None).
	DWORD DirectoryType;	// Flags for the directory item.  (0x00000000 == Folder).
	DWORD ParentIndex;	// Index of the parent directory item.  (0xFFFFFFFF == None).
	DWORD NextIndex;	// Index of the next directory item.  (0x00000000 == None).
	DWORD FirstIndex;	// Index of the first directory item.  (0x00000000 == None).
} GCFDIRECTORYENTRY, *LPGCFDIRECTORYENTRY;

//GCF Directory Info 1 Entry
typedef struct tagGCFDIRECTORYINFO1ENTRY
{
	DWORD Dummy0;
} GCFDIRECTORYINFO1ENTRY, *LPGCFDIRECTORYINFO1ENTRY;

//GCF Directory Info 2 Entry
typedef struct tagGCFDIRECTORYINFO2ENTRY
{
	DWORD Dummy0;
} GCFDIRECTORYINFO2ENTRY, *LPGCFDIRECTORYINFO2ENTRY;

//GCF Directory Copy Entry
typedef struct tagGCFDIRECTORYCOPYENTRY
{
	DWORD DirectoryIndex;	// Index of the directory item.
} GCFDIRECTORYCOPYENTRY, *LPGCFDIRECTORYCOPYENTRY;

//GCF Directory Local Entry
typedef struct tagGCFDIRECTORYLOCALENTRY
{
	DWORD DirectoryIndex;	// Index of the directory item.
} GCFDIRECTORYLOCALENTRY, *LPGCFDIRECTORYLOCALENTRY;
			

DirectoryHeader Notes:

1. DirSize contains the total directories size in number of bytes from the start of GCFDirHeader. Thus GCFDirHeader + GCFDirHeader.DirSize = GCFDirMapHeader.

2. NameSize containts the total length of the buffer containing the directory item names.

3. The checksum does not compute for the header (maybe due to overflow)

DirectoryEntry Notes:

1. CheckIndex contains an index in the CheckMapEntries. The value is 0xFFFFFFFF in case of a folder, simply because a folder doesnt have a checksum.

2. DirectoryType. Addict; "I've observed several values here. It looks to me as if it is some sort of bit mask which is composed of the following components, however I have seen other values."

  • 0x00004000 - Item is a file
  • 0x0000000A - A local copy of the file is to be made in the SteamApps\<steamaccount>\<game> directory
  • 0x00000040 or 0x00000001 - local copies have priority over cache copies and are not to be overwritten

As a general rule the value 0x00000000 is a folder, anything else is a file (I don't know what the difference between 0x00004000 and 0x0000400A is; again you should note that these are the only other observed values so far though).

3. ParentIndex is 0xFFFFFFFF when there is none.

4. NextIndex is 0x00000000 when there is none.

5. FirstIndex is 0x00000000 when there is none.

6. Directory/File Names - The directory and file name list is just a list of names in plaintext format terminated by a NULL (0x00) character. The first character in the list is always NULL which indicated "root".

7. GCFDirInfo1Entry - Contains Info1Count entries with unknown data.

8. GCFDirInfo2Entry - Contains ItemCount entries with unknown data.

9. GCFDirCopyEntry - Contains a list of CopyCount entries which denotes that a local copy of the file is to be made in the SteamApps\<steamaccount>\<game> directory.

10. GCFDirLocalEntry - Contains a list of LocalCount entries which denote local file copies that have priority over cache copies and are not to be overwritten.

Directory Map

The Directory Map contains a mapping between the items in the Directory section and the data in the Blocks section. Basicly it maps the item to the first block used by its data.

//GCF Directory Map Header
typedef struct tagGCFDIRECTORYMAPHEADER
{
	DWORD Dummy0;			// Always 0x00000001
	DWORD Dummy1;			// Always 0x00000000
} GCFDIRECTORYMAPHEADER, *LPGCFDIRECTORYMAPHEADER;

//GCF Directory Map Entry
typedef struct tagGCFDIRECTORYMAPENTRY
{
	DWORD FirstBlockIndex;	// Index of the first data block. (N/A if == BlockCount.)
} GCFDIRECTORYMAPENTRY, *LPGCFDIRECTORYMAPENTRY;
			

The Directory Map structure looks like:

  • GCFDirMapHeader: GCF directory map header
  • GCFDirMapEntry: GCF directory map entries

Directory Map Notes:

1. The GCFDirMapEntry uses the same index as the GCFDirEntry, so when looking up a file in the entry list one can easily get the First Block index of that entry. If the First Block index is BlockCount, then there is no data for that Directory entry.

Checksums

The checksums section contains some form of checksums for the data in the files. I have no idea how these checksums are calculated or what they stand for. However, they do appear to be related to a defined length of data of the corresponding Directory entry.

//GCF Checksum Header
typedef struct tagGCFCHECKSUMHEADER
{
	DWORD Dummy0;			// Always 0x00000001
	DWORD ChecksumSize;		// Size of LPGCFCHECKSUMHEADER & LPGCFCHECKSUMMAPHEADER & in bytes.
} GCFCHECKSUMHEADER, *LPGCFCHECKSUMHEADER;

//GCF Checksum Map Header
typedef struct tagGCFCHECKSUMMAPHEADER
{
	DWORD Dummy0;			// Always 0x14893721
	DWORD Dummy1;			// Always 0x00000001
	DWORD ItemCount;		// Number of items.
	DWORD ChecksumCount;		// Number of checksums.
} GCFCHECKSUMMAPHEADER, *LPGCFCHECKSUMMAPHEADER;

//GCF Checksum Map Entry
typedef struct tagGCFCHECKSUMMAPENTRY
{
	DWORD ChecksumCount;		// Number of checksums.
	DWORD FirstChecksumIndex;	// Index of first checksum.
} GCFCHECKSUMMAPENTRY, *LPGCFCHECKSUMMAPENTRY;

//GCF Checksum Entry
typedef struct tagGCFCHECKSUMENTRY
{
	DWORD Checksum;				// Checksum.
} GCFCHECKSUMENTRY, *LPGCFCHECKSUMENTRY;
			

The Checksums structure looks like:

  • GCFCheckHeader: GCF checksum header
  • GCFCheckMapHeader: GCF checksum map header
  • GCFCheckMapEntry: GCF checksum map entries
  • GCFCheckEntry: GCF checksum entries

GCFCheckHeader Notes:

1. The CheckSize contains the total checksums size in number of bytes from the start of GCFCheckMapHeader. Thus GCFCheckMapHeader + GCFCheckHeader.CheckSize = GCFDataHeader.

GCFCheckMapEntry Notes:

1. It seems as if each checksum is generated for a maximum of 0x8000 bytes (4 * BlockSize = 0x2000). This means that files larger than 0x8000 bytes have several checksums for each piece of data. CheckCount contains this number of checksums.

GCFCheckEntry Notes:

1. There are a total of (GCFCheckMapHeader.CheckCount + 0x20) entries. From what these other 0x20 checksums are, I have no idea, nor do I have any idea on how the other checksums are calculated.

Data Blocks

The data that makes up the files is divided into 8kb (8912 byte) blocks preceeded with a small header. If a block is not completely used, the remaining unused bytes are padded out with NULL characters (0x00).

//GCF Data Header
typedef struct tagGCFDATABLOCKHEADER
{
	DWORD GCFVersion;	// GCF file version.
	DWORD BlockCount;	// Number of data blocks.
	DWORD BlockSize;	// Size of each data block in bytes.
	DWORD FirstBlockOffset; // Offset to first data block.
	DWORD BlocksUsed;	// Number of data blocks that contain data.
	DWORD Checksum;		// Header checksum.
} GCFDATABLOCKHEADER, *LPGCFDATABLOCKHEADER;
			

Notes:

1. FirstBlockOffset is the offset from the start of the file to the data blocks.

2. There can be a region or garbage between the data block header and the data blocks.

3. BlocksUsed indicates total blocks with data in them.

4. The checksum of the header does NOT take the GCFVersion into account.

Legal/Disclaimer

The information on this page was derived from studying several Steam GCF files with a Hex Editor and human logic. No software was reverse engineered or source code used.

The information on this page comes with no warranty or liability of any kind. None of the contributers to this documentation take any responsibility for any result of using this information or sample code.