In this section, where the description of a file says that an item is an offset into another file, that file may be located in the same CHM, or it may be located in an accompanying CHI file.
The different types of ITSF files contain different internal files. The list below indicates which file types contain which internal files:
The files I have seen so far have been empty or filled with zero BYTEs so who knows. My guess is that it has something to do with information types. The file where it had a non-zero size (12 zero BYTEs in VOICESDK.CHI from the MSDN) also had a non-zero /#SYSTEM code 15 (Information type checksum) entry of 0xFFFFFFFF.
Offset | Type | Comment/Value |
0 | DWORD | 2/3 (Version number) |
4 | /#SYSTEM entries to the EOF |
/#SYSTEM entries have the following format:
Offset | Type | Comment/Value |
0 | WORD | code - see below for values & meanings |
2 | WORD | length of data |
4 | BYTEs | data |
In the below list of the different codes the order of the codes in the /#SYSTEM file is 10, 9, 4, 2, 3, 16, 6, (5,0,1 or 0,1,5 - haven't been able to make files with all three), 7, 11, 12, 13, 14, 8 and lastly 15.
Code | Explanation | |||||||||||||||||||||||||||
0 | Value of Contents file in [OPTIONS] section of the HHP file. NT | |||||||||||||||||||||||||||
1 | Value of Index file in [OPTIONS] section of the HHP file. NT | |||||||||||||||||||||||||||
2 | Value of Default topic in [OPTIONS] section of the HHP file. NT | |||||||||||||||||||||||||||
3 | Value of Title in [OPTIONS] section of the HHP file. NT | |||||||||||||||||||||||||||
4 | 28 (HHA Version 4.72.7294 and earlier) or 36 (HHA Version 4.72.8086 and later) byte structure:
| |||||||||||||||||||||||||||
5 | Value of Default Window in [OPTIONS] section of the HHP file. NT | |||||||||||||||||||||||||||
6 | Value of Compiled file in [OPTIONS] section of the HHP file. This is the lowercase of the stem of the CHM file name. If the name of the CHM is "..\bar\foo\ FOO-Bar . chm jimmy is a poo-bum" then this will be " foo-bar ". NT | |||||||||||||||||||||||||||
7 | *DWORD present in files with "Binary Index=Yes". The entry in the /#URLTBL file that points to the sitemap index had the same first DWORD. | |||||||||||||||||||||||||||
8 | Rare. VOICESDK.CHM & CHI and WOSA.CHI from the MSDN have one. The abbreviations and explanations seem to be the same in WOSA.CHI & VOICESDK.CHM, except for 2 mistakes (one in VOICESDK.CHM & one in WOSA.CHI) that seem to be created by bugs in the compiler. Both were compiled by the same version of HHA (4.72.8086), so perhaps this version has some weird bug. Each entry is 16 BYTEs:
| |||||||||||||||||||||||||||
9 | The version/program that the CHM was compiled by - shown in the version dialog as "Compiled with %s" where %s is what is in this entry of the /#SYSTEM file. If compiled with the MS HTML Help Author dll then it will be something like "HHA Version 4.74.8702". It comes directly from the resource strings of HHA.dll (I saw it there in Unicode and successfully altered it). Beware that the text control in the version dialog that displays it is only so big and in some cases the string won't be displayed, & in other cases only part, depending apon the effect of wrapping, so if you write a compiler, be sure to test it and use a short name and version. Usually NT, but HH won't crash if it isn't. | |||||||||||||||||||||||||||
10 | time_t timestamp (DWORD). Not sure of the start year yet. | |||||||||||||||||||||||||||
11 | *DWORD present in files with "Binary TOC=Yes". The entry in the /#URLTBL file that points to the sitemap contents has the same first DWORD. | |||||||||||||||||||||||||||
12 | *Number of information types (DWORD). | |||||||||||||||||||||||||||
13 | *The /#IDXHDR file contains exactly the same bytes. See below for more info | |||||||||||||||||||||||||||
14 | Rare. The ones I saw were from MS Word 2000. My guess is that it is an MSOffice extension (or maybe not) that overrides the names & window types of the navigation tabs. DWORD number of windows to override, 2 ANSI NT strings for each window. The first is the text for the tab & the second is probably the name of the window type to use. (eg 2, "&Answer Wizard\0MsoHelpAWDlg\0&Index\0MsoHelpKeyDlg\0") These are from the Custom tab variables of the [OPTIONS] section of the HHP file. The resources from MSOHELP.EXE have a weird .reg file that gives the CLSIDs involved in the provision of these dialogs. | |||||||||||||||||||||||||||
15 | *Information type checksum (DWORD). Unknown algorithm & data source. | |||||||||||||||||||||||||||
16 | Value of Default Font in [OPTIONS] section of the HHP file. NT | |||||||||||||||||||||||||||
17-65535 | Not yet seen. Please let us know if you see these. | |||||||||||||||||||||||||||
*Not present in files with "Compatibility=1.0" |
This has exactly the same bytes as the code 13 entry in the /#SYSTEM file and is 4096 bytes long.
Offset | Type | Comment/Value |
0 | char[4] | T#SM |
4 | DWORD | Unknown timestamp/checksum |
8 | DWORD | 1 (unknown) |
0xC | DWORD | Number of topic nodes including the contents & index files |
0x10 | DWORD | 0 (unknown) |
0x14 | DWORD | Offset in the /#STRINGS file of the ImageList param of the "text/site properties" object of the sitemap contents (0/-1 = none) |
0x18 | DWORD | 0 (unknown) |
0x1C | DWORD | 1 if the value of the ImageType param of the "text/site properties" object of the sitemap contents is "Folder". 0 otherwise. |
0x20 | DWORD | The value of the Background param of the "text/site properties" object of the sitemap contents |
0x24 | DWORD | The value of the Foreground param of the "text/site properties" object of the sitemap contents |
0x28 | DWORD | Offset in the /#STRINGS file of the Font param of the "text/site properties" object of the sitemap contents (0/-1 = none) |
0x2C | DWORD | The value of the Window Styles param of the "text/site properties" object of the sitemap contents |
0x30 | DWORD | The value of the ExWindow Styles param of the "text/site properties" object of the sitemap contents |
0x34 | DWORD | Unknown. Often -1. Sometimes 0. |
0x38 | DWORD | Offset in the /#STRINGS file of the FrameName param of the "text/site properties" object of the sitemap contents (0/-1 = none) |
0x3C | DWORD | Offset in the /#STRINGS file of the WindowName param of the "text/site properties" object of the sitemap contents (0/-1 = none) |
0x40 | DWORD | Number of information types. |
0x44 | DWORD | Unknown. Often 1. Also 0, 3. |
0x48 | DWORD | Number of files in the [MERGE FILES] list |
0x4C | DWORD | Unknown. Often 0. Non-zero mostly in files with some files in the merge files list. |
0x50 | DWORD[1004] | List of offsets in the /#STRINGS file that are the [MERGE FILES] list. Zero terminated, but don't count on it. |
This file contains information on the window types in the CHM. It has the following format:
Offset | Type | Comment/Value |
0 | DWORD | Number of entries in the file |
4 | DWORD | Size of each of the entries in the file (188 or 196) |
8 | /#WINDOWS entries to the EOF |
/#WINDOWS entries are basically HH_WINTYPE structures as specified in htmlhelp.h. Note the first DWORD can be used to specify different versions of this structure. Also note that the HHW docs show a different structure to htmlhelp.h. Therefore many CHM files need to be surveyed to find structures with sizes other than 188 or 196. In the description of /#WINDOWS entries below, Arg n means that that item is argument n of the window definition in the HHP file, either converted to a DWORD or to an offset in the indicated file:
Offset | Type | Comment/Value |
0 | DWORD | Size of the entry (188 in CHMs compiled with "Compatibility=1.0", 196 in CHMs compiled with "Compatibility=1.1 or later") |
4 | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "BOOL fUniCodeStrings; // IN/OUT: TRUE if all strings are in UNICODE" |
8 | DWORD | Arg 0. Offset in /#STRINGS file. |
0xC | DWORD | Which window properties are valid & are to be used for this window. See the table below. |
0x10 | DWORD | Arg 10. |
0x14 | DWORD | Arg 1. Offset in /#STRINGS file. |
0x18 | DWORD | Arg 14. |
0x1C | DWORD | Arg 15. |
0x20 | RECT | Arg 13. Order left, top, right & bottom. |
0x30 | DWORD | Arg 16. |
0x34 | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHelp; // OUT: window handle" |
0x38 | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndCaller; // OUT: who called this window" |
0x3C | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "HH_INFOTYPE* paInfoTypes; // IN: Pointer to an array of Information Types" |
0x40 | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndToolBar; // OUT: toolbar window in tri-pane window" |
0x44 | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndNavigation; // OUT: navigation window in tri-pane window" |
0x48 | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHTML; // OUT: window displaying HTML in tri-pane window" |
0x4C | DWORD | Arg 11. |
0x50 | BYTE[16] | 0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcHTML; // OUT: HTML window coordinates" & the HHW docs say "Specifies the coordinates of the Topic pane." |
0x60 | DWORD | Arg 2. Offset in /#STRINGS file. |
0x64 | DWORD | Arg 3. Offset in /#STRINGS file. |
0x68 | DWORD | Arg 4. Offset in /#STRINGS file. |
0x6C | DWORD | Arg 5. Offset in /#STRINGS file. |
0x70 | DWORD | Arg 12. |
0x74 | DWORD | Arg 17. |
0x78 | DWORD | Arg 18. |
0x7C | DWORD | Arg 19. |
0x80 | DWORD | Arg 20. |
0x84 | BYTE[20] | 0 (unknown) - but htmlhelp.h indicates that this is "BYTE tabOrder[HH_MAX_TABS + 1]; // IN/OUT: tab order: Contents, Index, Search, History, Favorites, Reserved 1-5, Custom tabs" |
0x98 | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "int cHistory; // IN/OUT: number of history items to keep (default is 30)" |
0x9C | DWORD | Arg 7. Offset in /#STRINGS file. |
0xA0 | DWORD | Arg 9. Offset in /#STRINGS file. |
0xA4 | DWORD | Arg 6. Offset in /#STRINGS file. |
0xA8 | DWORD | Arg 8. Offset in /#STRINGS file. |
0xAC | BYTE[16] | 0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcMinSize; // Minimum size for window (ignored in version 1)" |
Everything after here is only present in CHMs compiled with "Compatibility=1.1 or later". | ||
0xBC | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "int cbInfoTypes; // size of paInfoTypes;" |
0xC0 | DWORD | 0 (unknown) - but htmlhelp.h indicates that this is "LPCTSTR pszCustomTabs; // multiple zero-terminated strings" |
Value | Valid property |
0x00000002 | Navigation Pane Style. |
0x00000004 | Style Flags. |
0x00000008 | Extended Style Flags. |
0x00000010 | Initial Position. |
0x00000020 | Navigation Pane Width. |
0x00000040 | Show state. |
0x00000080 | Info types. |
0x00000100 | Buttons. |
0x00000200 | Navigation Pane initially closed state. |
0x00000400 | Tab pos. |
0x00000800 | Tab order. |
0x00001000 | History count. |
0x00002000 | Default Pane. |
0x?????00? | The rest of the values either do nothing or are unknown. Please let us know if you find out what the rest are. |
This file is a list of ANSI NT strings. The first is just a NIL character so that offsets to that file can specify zero & get a valid string. The strings are in this order; "\0", [WINDOWS] (Arg 0, Arg 1, Arg 7, Arg 9, Arg 2, Arg 3, Arg 4, Arg 5, Arg 6, Arg 8) #n..., Contents_0_Entry_title, Index_0_Keyword, Contents_Image_file, Contents_Font, Contents_Default_frame, Contents_Default_window, [MERGE FILES] #n...
Present in files with a non-empty contents file, "Binary TOC=Yes" and "Compatibility=1.1 or later".
Offset | Type | Comment/Value |
0 | DWORD | 4096/header length/offset of #1 below |
4 | DWORD | offset of #3 below |
8 | DWORD | number of #3 below |
0xC | DWORD | offset of #2 below |
0x10 | BYTE[4080] | 0 (unknown) |
The header is followed by the following different types of structs in the specified order:
First all the top level books/pages, then the next level, then the next & so on
Offset | Type | Comment/Value |
0 | WORD | 0 (unknown) |
2 | WORD | Unknown |
4 | DWORD | Seems to be a bit field: 0x2 is whether or not the New value is set to 1, 0x4 is set when the entry is a book/has children and 0x8 is set when the entry has a Local value. The other bits are unknown (0x1, 0x40, 0x100 are sometimes set on books). |
8 | DWORD | Unknown. In some cases it is an index into the /#TOPICS file of the entry containing offsets to the title & filename. |
0xC | DWORD | Offset to the parent book. |
0x10 | DWORD | Offset to the next book/page in the current book/page. |
The next two DWORDs are only present in books (28 byte structs) | ||
0x14 | DWORD | Offset to the first child of the book. |
0x18 | DWORD | 0 (unknown) |
Offset | Type | Comment/Value |
0 | DWORD | Offset into #1 above. |
4 | DWORD | Some kind of sequence number that is incremented by one and starts at 666. I swear :) |
8 | DWORD | Offset into #2 above. Can contain RAM litter. |
0xC | DWORD | Index in /#TOPICS file of the entry containing offsets to the title & filename. Can contain RAM litter. |
This file contains information on the topics present. It is sorted by URL.
Each entry has the following format.
Offset | Type | Comment/Value |
0 | DWORD | Offset into the tree in the /#TOCIDX file. |
4 | DWORD | Offset in /#STRINGS file of title. -1 = no title. |
8 | DWORD | Offset in /#URLTBL of entry containing offset to /#URLSTR entry containing the URL. |
0xC | DWORD | 2 indicates not in contents, 6 indicates that it is in the contents, 0/4 something else (unknown) |
Before all the entries is an unknown BYTE. So far this has been 0, 0x42 and in spechsdk.chi it was 0x49. Does not indicate presence/absence of URL/FrameName strings.
This is followed by all the URL strings from the HHC (NT). Then come all the FrameName strings from the HHC (NT).
Then come the other entries. Each entry has the following format.
Offset | Type | Comment/Value |
0 | DWORD | Offset of the URL for this topic. |
4 | DWORD | Offset of the FrameName for this topic. |
8 | ANSI/UTF-8 NT string that is the Local for this topic. |
Each block has the following format.
Offset | Type | Comment/Value |
0 | DWORD[3][341] | 341 entries. 12 bytes each. |
0xFFC | DWORD | 4096 (unknown) possibly the length of the block? That MS would pull this kind of shit is really annoying; they should have just put all the entries one after another, not stuffed in an arbitrary DWORD after every 4092 bytes. |
Each entry has the following format.
Offset | Type | Comment/Value |
0 | DWORD | Unknown. I suspect that this is either some kind of unique ID or two WORDs. |
4 | DWORD | Index of entry in /#TOPICS file. |
8 | DWORD | Offset in /#URLSTR file of entry containing filename. |
This is basically the [ALIAS] section of the HHP file.
Offset | Type | Comment/Value |
0 | DWORD | Size of the file minus 4 (num entries = (filelen-4)/8) |
4 | /#IVB entries to the EOF |
/#IVB entries have the following format.
Offset | Type | Comment/Value |
0 | DWORD | The value of the alias |
4 | DWORD | Offset in /#STRINGS file of the file to show |
This file is present when the [SUBSETS] section is present in the HHP file.
Offset | Type | Comment/Value |
0 | WORD | 0 (unknown) |
2 | WORD | Number of bytes taken up by the subset entries. |
4 | Subset entries. |
The subset entries currently seem to be garbage left over from previous usage of the same memory locations. Based on the number of bytes per non-whitespace line in the [SUBSETS] section each subset entry is 12 BYTEs in length.
Empty when "Full-text search=No" or when no HTML files have been indexed. Holds the full-text search information. Absolutely no line numbers/offsets to files are stored! If you have a word longer than 99 characters in a HTML file then it seems the indexing routines will die during indexing of that file and then skip on to the next one. All word sorting, processing and storage is done case-insensitively and is not case-preserving. Note that files without ".h" in their names will not contribute keywords to this fast-search index. The function of this file seems to be to store the words found in any of the HTML files, so the search code can quickly eliminate those words that are not present.
All the below stuff has only been tested with one input HTML file.
The file begins with a header that is 0x400 bytes in length.
Offset | Type | Comment/Value |
0 | BYTE[4] | 0x00 0x00 0x28 0x00 (unknown) |
4 | DWORD | Number of HTML files indexed after any automatic splitting. |
8 | DWORD | Offset to the last word tree block (4096 less than the file length) |
0xC | DWORD | 0 (unknown) |
0x10 | DWORD | Unknown. |
0x14 | DWORD | Offset to the last word tree block (4096 less than the file length) |
0x18 | WORD | 1/2 (unknown) |
0x1A | DWORD | 7 (unknown) |
0x1E | BYTE | 2 (unknown) |
0x1F | BYTE | Unknown. |
0x20 | BYTE | 2 (unknown) |
0x21 | BYTE | Unknown. |
0x22 | BYTE | 2 (unknown) |
0x23 | BYTE | Unknown. |
0x24 | BYTE[10] | 0 (unknown) |
0x2E | DWORD | Length of the word tree blocks (4096). |
0x32 | DWORD | 0/1 (unknown) |
0x36 | DWORD | Word index of the last duplicate. |
0x3A | DWORD | Character index of the last duplicate. From the first character of the first word. The whitespace after tags is not included. & type things are counted as one character. Line endings are not counted in this. |
0x3E | DWORD | Length of the longest word in the list not including NT (maximum of 99). |
0x42 | DWORD | Number of words including duplicates. |
0x46 | DWORD | Number of words not including duplicates. |
0x4A | DWORD | The total length of all the words including duplicates is this DWORD plus the next one. It is unknown how the split is performed. |
0x4E | DWORD | This one is usually smaller than the previous one. |
0x52 | DWORD | Total length of all the words not including duplicates. |
0x56 | DWORD | Length of unused/null bytes at the end of the word block (if only 1 block, more than total if > 1 block - possible some free space in tables). |
0x5A | DWORD | 0 (unknown) |
0x5E | DWORD | One less than the number of HTML files indexed (not entirely sure) |
0x62 | BYTE[24] | 0 (unknown) |
0x7A | DWORD | Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI)) |
0x7E | DWORD | LCID from the HHP file. |
0x82 | BYTE[894] | 0 (unknown) |
The header is followed by pairs of unknown variable size blocks (presumably a table of urls) and word tree blocks.
The blocks containing the word trees are 4096 bytes in length.
If there is 2 or more word tree blocks then the second last one will have a zero next offset and the last one will only have a WORD header that indicates the length of free space at the end of the current word tree block. Also the last one will have different word entries and there is no table before the last word block.
Offset | Type | Comment/Value |
0 | DWORD | Offset to the next word tree block. 0 if this is the second last word tree block or there is only one word tree block. |
4 | WORD | 0 (unknown) |
6 | WORD | Length of free space at the end of the current word tree block. |
This is followed by word entries:
Offset | Type | Comment/Value |
0 | BYTE | Length of the word/partial word in this entry including the NT (Don't count on the NT though). Maximum of 100. |
1 | BYTE | Position in the word where characters are placed. |
2 | BYTEs | Length bytes make up the word or part of the word. NT (Don't count on the NT though) |
+0 | BYTE | Unknown. Some kind of block number? |
+1 | BYTE | Index number |
+2 | BYTE | Unknown. Some kind of block number? |
+3 | DWORD | Unknown. Some kind of block number? |
+7 | BYTE | How much to increase the index number by for the next entry. |
I found a bug in the normal entries of several CHMs, where the length is 1, the position points to the NT and the data is an 0x02 BYTE. This means that the NT will be overwritten & invalid address might be accessed, unless the reader is robust. Besides this no such 0x02 byte occurred in the HTML & if it did it would not be considered part of a word, so perhaps this has another meaning, like the BYTEs being ENCINTs.
Offset | Type | Comment/Value |
0 | BYTE | One more than the length of the word/partial word in this entry. |
1 | BYTE | Position in the word where characters are placed (0). |
2 | BYTEs | Length bytes make up the word or part of the word. Not NT |
+0 | BYTE | Unknown. |
+1 | BYTE | Unknown. |
+2 | DWORD | 0/1 (unknown) |
WORDs are made up of the following characters stored as is: 0x01 (buggy), 0-9, a-z, _, 0xDE, 0xFE. The following are converted and stored: A-Z are converted to lower case; 0x8A, 0x9A are converted to s; 0x8C, 0x9C are converted to oe; 0x9F, 0xDD, 0xFD, 0xFF are converted to y; 0xC0-0xC5, 0xE0-0xE5 are converted to a; 0xC6, 0xE6 are converted to ae; 0xC7, 0xE7 are converted to c; 0xC8-0xCB, 0xE8-0xEB are converted to e; 0xCC-0xCF, 0xEC-0xEF are converted to i; 0xD0 is converted to d; 0xD1, 0xF1 are converted to n; 0xD2-0xD8, 0xF0, 0xF2-0xF8 are converted to o; 0xD9-0xDC, 0xF9-0xFC are converted to u; 0xDF is converted to ss. These conversons may depend on the codepage, character set, font and language set in the HHP file (I'm just guessing here). There are a few bugs: An 0x1 in a word causes a space to be placed at the end of the word and then the word is joined to the next word. This bug affects the fields in the header. Weird bug where if the word is 16 characters in length then the word is doubled plus the first 7 chars in length. And probably many more hiddden ones.
Different blocks of bytes. Several types. Methinks it works like a tree.
At each letter of a word you can either terminate (if no duplicates at that letter)
and specify the rest of the word or you can branch out to each variant of that letter.
The advantage of this is that you don't need to search a huge list every time you
do a search.
nope
first you have the whole word at level zero
then you take the rest of the words that begin with the same character
& sort them
I think there are different blocks for terminal (leaves) & branching nodes (branches)
The words are set to lowercase.
There would also be some sort of table to point to the urls that contain the words.
I think that the word blocks form a tree-like structure
1. slurp all the words out of the HTML
2. sort them
3. weed out duplicates
From the name and the number of GUIDs present I guess it has something to do with ActiveX objects. Seems like it can be deleted without major consequence.
Offset | Type | Comment/Value |
0 | DWORD | 0x04000000 (unknown) |
4 | DWORD | Number of entries |
This is followed by an listing, and each listing entry is as follows
0 | DWORD | Offset of the entry in this file |
4 | DWORD | Length of the entry |
The listing is followed by the entries one after another at offsets specified in the listing.
There are 2 known types of entries. The first seems to be made up of up to 3 different sub entries. The second is a 36 BYTE structure.
Offset | Type | Comment/Value |
0 | GUID | {4662DAAF-D393-11D0-9A56-00C04FB68BF7} |
0x10 | DWORD | 0x04000000 (unknown) Possibly a big-endian version number of the class that the GUID refers to. |
0x14 | DWORD | Unknown. Methinks bitflags that somehow affect the size of entries that have the 0x04000000 DWORD, like each bit specifies the presence/absence of a specific subentry. |
0x18 | DWORD | Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI)) |
0x1C | DWORD | LCID from the HHP file. |
0x20 | BYTEs | Unknown |
+0 | Entries |
I haven't been able to find any files without the data for bits 0 & 1 so I can't really say exactly how big the header is and which bytes are part of the bit 0 block and which are part of the bit 1 block. Together, though, bits 0 & 1 account for a large bulk of repeatedly increasing byte blocks of 10 bytes each, plus something else at the end. I suspect that the repeats are for bit 0 and the stuff at the end is bit 1. As to the function of these two bits blocks, well there are no GUIDs and no other clues, so who knows.
Offset | Type | Comment/Value |
0 | char[4] | ""(\0 |
4 | DWORD | Length in bytes of the entries not including the last zero word. |
8 | BYTE[32] | 0 (unknown) |
0x28 | Entries. The last entry has a zero length word. |
Offset | Type | Comment/Value |
0 | WORD | Length of the word |
2 | char[length] | ANSI string from the stop list file, may be uniqified & sorted reverse alphabetically. Not NT. |
Offset | Type | Comment/Value |
0 | GUID | {8FA0D5A8-DEDF-11D0-9A61-00C04FB68BF7} |
0x10 | DWORD | 0x04000000 (unknown) Possibly a big-endian version number of the class that the GUID refers to. |
0x14 | DWORD | 1 (unknown) |
0x18 | DWORD | Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI)) |
0x1C | DWORD | LCID from the HHP file. |
0x20 | DWORD | 0 (unknown) |
Offset | Type | Comment/Value |
0 | GUID | {4662DAB0-D393-11D0-9A56-00C04FB68B66} |
0x10 | DWORD | 666 (May represent the version of the class that the GUID refers to) |
0x14 | DWORD | Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI)) |
0x18 | DWORD | LCID from the HHP file. |
0x1C | DWORD | Unknown. Almost always 10031. Also 66631 (accessib.chm from the MSDN). |
0x20 | DWORD | 0 (unknown) |
The files in the /$WWAssociativeLinks and /$WWKeywordLinks directories have the same formats. The maximum total length (including parents) of an entry in one of these files is 488 characters (including NT). HHW complains about and refuses to output any that are greater than this length.
The /$WWKeywordLinks dir specifies the contents of the Index navigation pane & the /$WWAssociativeLinks dir specifies the Alinks.
This file has a 76 byte header, then 2048 byte blocks. First come all the listing blocks, then all the index blocks. This file is similar to the directory entries in the ITSF format, except that the index blocks are at the end instead of interspersed with the listing blocks. All block indices below are zero based. This file forms a tree, with the last (index mostly) block being the root of the tree. If there is more than one level of index blocks then the root block will have two children; the first in the block header and the second in the entry. WARNING: just as in the ITSF directory there can be garbage in the free space, so respect that first WORD and use it. I'm not yet sure how the listing blocks are split up, though it is probably the same as the ITSF directory (space filling).
Offset | Type | Comment/Value |
0 | BYTE[4] | Unknown. |
4 | WORD | Size of the blocks. |
6 | BYTE[3] | Unknown. |
9 | BYTE[17] | 0 (unknown) |
0x1A | DWORD | Index of the last listing block in the file. |
0x1E | DWORD | Index of the last block in the file |
0x22 | DWORD | -1 (unknown) |
0x26 | DWORD | Number of blocks |
0x2A | WORD | The depth of the tree of blocks (1 if no index blocks, 2 one level of index blocks, ...) |
0x2C | DWORD | Number of keywords in the file. |
0x30 | DWORD | Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI)) |
0x34 | DWORD | LCID from the HHP file. |
0x38 | DWORD | 0 if this BTree is part of a CHW file, 1 if it is part of a CHI or CHM file |
0x3C | DWORD | Unknown. Almost always 10031. Also 66631 (accessib.chm, ieeula.chm, iesupp.chm, iexplore.chm, msoe.chm, mstask.chm, ratings.chm, wab.chm). |
0x40 | DWORD | 0 (unknown) |
0x44 | DWORD | 0 (unknown) |
0x48 | DWORD | 0 (unknown) |
Offset | Type | Comment/Value |
0 | WORD | Length of free space at the end of the block. |
2 | WORD | Number of entries in the block. |
4 | DWORD | Index of the previous block. -1 if this is the first listing block. |
8 | DWORD | Index of the next block. -1 if this is the last listing block. |
Offset | Type | Comment/Value |
0 | WCHARs | Value of the first Name entry from the HHK. If this is a sub-keyword, then this will be all the parent keywords, including this one, separated by ", ". UTF-16 NT. |
+0 | WORD | 2 if this keyword is a See Also keyword, 0 if it is not. |
+2 | WORD | Depth of this entry into the tree. |
+4 | DWORD | Character index of the last keyword in the ", " separated list. |
+8 | DWORD | 0 (unknown) |
+0xC | DWORD | Number of Name, Local pairs |
+0x10 | DWORDs or WCHARs | DWORDs:Index into the /#TOPICS file. UTF-16 NT string: The value of the See Also string. |
+0 | DWORD | Unknown |
+4 | DWORD | Zero based index of this entry in the file (not block). Increments by 13 (each entry is 13 more than the last). |
Offset | Type | Comment/Value |
0 | WORD | Length of free space at the end of the block. |
2 | WORD | Number of entries in the block. |
4 | DWORD | Index of a child block. |
Offset | Type | Comment/Value |
0 | WCHARs | Value of the first Name entry from the HHK. If this is a sub-keyword, then this will be all the parent keywords, including this one, separated by ", ". UTF-16 NT. |
+0 | WORD | 2 if this keyword is a See Also keyword, 0 if it is not. |
+2 | WORD | Depth of this entry into the tree. |
+4 | DWORD | Character index of the last keyword in the ", " separated list. |
+8 | DWORD | 0 (unknown) |
+0xC | DWORD | Number of Name, Local pairs |
+0x10 | DWORDs or WCHARs | DWORDs:Index into the /#TOPICS file. UTF-16 NT string: The value of the See Also string. |
+0 | DWORD | Index of a child block. If it is a listing block then it is the one that starts with the keyword at the start of this entry |
This file contains entries that are 13 bytes in length. All known entries have thus far contained the following bytes: 00000000 05000000 80000000 00. AFAICS this file is useless.
Begins with a WORD indicating the number of entries in the file (also the number of listing blocks in the BTree file). Each entry is 2 DWORDs. The first is a cumulative sum of the number of keywords in the BTree listing blocks & the second is a consecutively increasing index number. Both start at zero.
If there are no links of this type in the CHM then this will be a zero DWORD. Othewise it contains the following DWORDs: 0, 0, 0, 0xC, 1, 1, 0, 0. AFAICS this file is pretty much useless.
The file begins with a WORD indicating the number of entries.
Each entry has the following format:
Offset | Type | Comment/Value |
0 | WORD | Length of the file stem. |
2 | BYTEs | File stem. ANSI string. Not NT. |
+0 | DWORD | Unknown. |
+4 | DWORD | Unknown. Same value as previous DWORD. |
+8 | DWORD | LCID of the specified file. |
The file begins with a WORD indicating the number of entries.
Each entry is 68 BYTEs in length and has the following format:
Offset | Type | Comment/Value |
0 | BYTE[25] | File stem. ANSI NT fixed length string. |
0x19 | BYTE[25] | Unknown. Seems to be RAM litter, but contains paths, file names, zero bytes, DWORDs and mixtures. |
0x32 | WORD | An index number that begins at 1 and is incremented by 1 for each entry. |
0x34 | DWORD | Unknown. |
0x38 | DWORD | Unknown. Same value as previous DWORD. |
0x3C | DWORD | LCID of the specified file. |
0x40 | DWORD | Number of topic nodes including the contents & index files in the specified file. |
It is a cache of user customized bits of the windowtype entry from the /#WINDOWS file of the \Path\file.chm CHM file.
Offset | Type | Comment/Value |
0 | DWORD | Size of the file in bytes (44) |
4 | Signed DWORD | Position of the left edge of the window. |
8 | Signed DWORD | Position of the top edge of the window. |
0xC | Signed DWORD | Position of the right edge of the window. |
0x10 | Signed DWORD | Position of the bottom edge of the window. |
0x14 | DWORD | Width of the navigation pane in pixels. |
0x18 | DWORD | Non-zero if search highlight is on. |
0x1C | DWORD | Unknown. Not font size, printing options or show state. |
0x20 | DWORD | Non-zero if there is no text of the toolbar buttons. |
0x24 | DWORD | Non-zero if the navigation pane is initially closed. |
0x28 | DWORD | Which navigation tab is currently open. |
UTF-16 NT string. Each search item is separated by a UTF-16 Line Feed character. The string is followed by an unknown WORD.
DWORD. Only the lowest 3 bits are used. "Match similar words" is controlled by bit 0. "Search titles only" is controlled by bit 1. "Search previous results" is controlled by bit 2. Note that since previous search results are not stored anywhere as yet HH will uncheck the "Search previous results" checkbox even if its bit is on. IMHO this is a bug: HH should automatically search the whole file if there are no previous results and the checkbox is checked.
A DWORD indicating the number of favourites stored for the \Path\file.chm CHM file.
An NT UTF-16 string showing the topic name of bookmark number n (n is zero based).
An NT UTF-16 string showing the URL of bookmark number n (n is zero based). It is a fully qualified path into the \Path\file.chm CHM file.
A set of ANSI/UTF-8 NT strings indicating which internal files have been deleted in the new file. Names use backslash (\) instead of forward slash (/) & don't have an initial slash.
Offset | Type | Comment/Value |
0 | DWORD[8] | Unknown. |
0x20 | DWORD | Length of the name of the old chm. |
0x24 | BYTEs | Name of the old chm. ANSI/UTF-8 NT. |
Please let us know if you find any other internal files, figure out formats of any internal files or find out what unknown parts of the above files do. Any and all contributions will be fully attributed and, if necessary, co-copyright given.