This document is the record of an ongoing struggle to decode the Visio files people keep sending me. I do not own a copy of Visio - or indeed any MS products - so this is a ground-up reverse engineering job. Please be patient, as this is going to take a long time - any assistance is very welcome.
Update January 2004 - I now have a helpful bunch of Visio files with their corresponding XML versions. This is helping a lot - I can now identify the font and colour tables at the start of the file, for example. It's still going to take several months to get this working.
Update May 2006 - I've been concentrating on the VDX (Visio XML export) format. It has all the data of the VSD format, but much less obfuscation.
Update November 2006 - Major progress in cracking the pointer structure and decompression (not by me, I must add). More details soon, with working code.
Update March 2007 - see Valek's vsdump project. I'm working on incorporating this into the Dia VDX plug-in, but first I'm just writing a simple dumping script, which I'll publish here when it works.
Update September 2008 - see HDGF for a Java implementation of VSDump as part of POI.
Visio uses the stream-based MS OLE file format. The full details are largely irrelevant, since so many others have decoded these (thanks to Martin Schwartz's excellent LAOLA documentation). For reference, I provide a Perl script olestreams.pl that will split a file into its OLE streams. Of the examples I possess, I can see the following:
The Visio stream itself
An EMF version of the first page of the Visio drawing, and a bunch of other metadata
The EMF stream is pretty easily decoded - see my EMF decoder.
The
DocumentInfo stream does not seem very helpful at the moment,
although it does have properties 'Pages' (number of pages),
'Masters', and the names of all the pages. It does not appear to
contain anything else that varies between documents.
The decoding
task thus focuses on the main stream.
The stream begins with the text "Visio (TM) Drawing\r\n", followed by six NULs and the short 0x6 - possibly a format version number.
There are a number of identifiable internal pointers within the main stream. The most important is the stream length.
Identifiable pointers:
+001c: document length
+0024: trailer section
There must be others, but only these have come to light so far. In particular, I would expect either pointers to each page or their lengths to be somewhere.
Update: An internal pointer has the structure
Subtype (4 bytes)
Address? (4 bytes)
Offset (4 bytes)
Length (4 bytes)
Type (2 bytes).
The first thing I am trying to decode is strings, which are usually the easiest things to spot in a file. Visio is unfortunately rather fiddly - it seems to be trying to use some form of LZW compression. This is going to be painful.
Strings are split into eight-byte segments. Each segment is preceded by a flag byte, the eight bits of which refer to the eight bytes of the segment.
If we are not at the start or the end of the string, and there are no tokens present in the next eight bytes, the flag byte is 0xff.
The giveaway here (ironically) is the reiteration of the string "Copyright 1999 Visio Corporation. All rights reserved.\0" throughout the file. Each time, the second 'right' gets replaced by a two-byte token pointing back to the first 'right'. This token must contain the offset to the original and its length, probably as specific bits. The existence of a token is shown by the appropriate flag bit being zero. The token also replaces the single byte that would have been there, so increasing the segment size.
An example:
00004ad0: 3743 6fff 7079 7269 6768 7420 ff31 3939 7Co.pyright .199 00004ae0: 3920 5669 73ff 696f 2043 6f72 706f ff72 9 Vis.io Corpo.r 00004af0: 6174 696f 6e2e 20df 2041 6c6c 20d8 3273 ation. . All .2s 00004b00: 20ff 7265 7365 7276 6564 f72e 00fe 6801 .reserved....h.
Here the ff denoted an ordinary group of eight, but the df indicates that the sixth byte has been replaced by d832 - presumably a pointer back to the 'right' earlier.
0 = fe %1111.1110 1 = fd %1111.1101 2 = fb %1111.1011 3 = f7 %1111.0111 4 = ef %1110.1111 5 = df %1101.1111 6 = bf %1011.1111 7 = 7f %0111.1111
Thus 0xdf indicates the sixth byte has been replaced.
Other values
for the token for 'right' include 3c42, 3e42, 4d22, 7642, da32, 2242,
bd32, d832. I deduce that the bottom half of the second nibble holds
the length of the replacement text minus 3, as no other bits are
constant amongst these. This also seems valid for a three-letter
replacement gets 0xeb70.
The flag byte also seems to denote the end of the text, although I don't yet know for sure. It looks as though multiple zeroes indicates text end, and therefore text cannot end with a token.
After looking at some samples, it seems clear that all the text is there, and that it is embedded in or separated by objects.
This seems to be similar in all documents. It may hold a page table, though its size is not related to the number of pages, nor does it appear to contain pointers.
I am currently concentrating on writing a text extractor, so I can confirm my understanding of the text - and hopefully find all the strings. I know each string must be labelled with formatting, coordinates etc. and it might thus be possible to determine a DrawString operator or somesuch.
Here I compare two files that are identical, both containing the same text, except that their save times are different and one has the text in italics.
This leads to the following differences in this stream:
< 000160 b4 00 00 00 03 00 00 00 04 00 00 00 10 00 00 00 < 000170 5f 56 50 49 44 5f 50 52 45 56 49 45 57 53 00 40 < 000180 03 00 00 00 18 00 00 00 5f 56 50 49 44 5f 41 4c < 000190 54 45 52 4e 41 54 45 4e 41 4d 45 53 00 51 f4 3f --- > 000160 b4 00 00 00 03 00 00 00 03 00 00 00 18 00 00 00 > 000170 5f 56 50 49 44 5f 41 4c 54 45 52 4e 41 54 45 4e > 000180 41 4d 45 53 00 00 00 00 04 00 00 00 10 00 00 00 > 000190 5f 56 50 49 44 5f 50 52 45 56 49 45 57 53 00 3f
Apparently a simple reordering of the name variables:
0lxb4, 0lx3, 0lx4, 0lx10
_VPID_PREVIEWS 0x0 0x40
0xl3, 0xl18
_VPID_ALTERNATENAMES 0x0 0x51 0xf4
becomes
0lxb4, 0lx3,
0lx3, 0lx16
_VPID_ALTERNATENAMES 0xl0
0lx4, 0lx10 VPID_PREVIEWS
0x0
This difference is absent in other files – it appears that these can be either way round.
46c46
< 0002e0 5a 02 00 00 e0 01 00 00 43 03 00 00 ff 01 00 00
---
> 0002e0 5a 02 00 00 e1 01 00 00 44 03 00 00 ff 01 00 00
63c63
< 0003f0 ab aa aa 3e 00 00 00 00 00 00 00 00 00 00 00 00
---
> 0003f0 ab aa aa 3e 00 00 00 00 02 00 00 00 00 00 00 00
84,85c84,85
< 000540 00 00 00 00 90 01 00 00 00 00 00 00 07 00 04 00
< 000550 41 00 72 00 69 00 61 00 6c 00 00 00 ff ff 4d 00
---
> 000540 00 00 00 00 90 01 00 00 01 00 00 00 07 00 04 00
> 000550 41 00 72 00 69 00 61 00 6c 00 00 00 00 00 4d 00
88,92c88,92
< 000580 67 00 75 00 6c 00 61 00 72 00 3a 00 56 00 65 00
< 000590 00 00 fe fb 40 82 e4 03 6f 67 35 00 78 85 1c 34
< 0005a0 14 00 00 00 00 00 13 00 e6 03 00 00 20 39 1d 00
< 0005b0 80 f0 1a 00 ec 05 00 00 00 04 00 00 0a 08 00 00
< 0005c0 04 0c 00 00 00 00 00 00 00 8e 03 00 00 00 13 00
---
> 000580 67 00 75 00 6c 00 61 00 72 00 20 00 49 00 74 00
> 000590 00 00 e1 fb a0 4e 2c 03 6f 67 35 00 d0 24 8f 3b
> 0005a0 1a 00 00 00 00 00 13 00 e6 03 00 00 88 f0 1a 00
> 0005b0 80 f0 1a 00 64 07 00 00 00 06 00 00 0a 08 00 00
> 0005c0 26 0c 00 00 00 00 00 00 70 32 03 00 00 00 13 00
94c94
< 0005e0 01 40 00 00 06 00 00 00 7c db 12 00 7c 4f 1f 34
---
> 0005e0 01 40 00 00 06 00 00 00 7c db 12 00 8c ee 89 31
96,97c96,97
< 000600 3c dc 12 00 a5 db f4 77 e4 03 16 95 01 de f4 77
< 000610 00 00 16 95 e0 dc 12 00 b8 31 1c 34 00 00 00 00
---
> 000600 3c dc 12 00 a5 db f4 77 08 0c 16 74 01 de f4 77
> 000610 00 00 16 74 e0 dc 12 00 00 df 89 31 00 00 00 00
99,102c99,102
< 000630 e4 dc 12 00 3b 68 f5 77 bd 0a 21 7e a0 dc 12 00
< 000640 e4 03 16 95 a8 84 28 00 7c 4f 1f 34 03 00 00 00
< 000650 00 00 00 00 8a 8b 28 00 bd 0a 21 7e 84 dc 12 00
< 000660 a8 84 28 00 7c 4f 1f 34 00 00 00 00 11 00 00 00
---
> 000630 e4 dc 12 00 3b 68 f5 77 78 0c 21 ce a0 dc 12 00
> 000640 08 0c 16 74 a8 84 28 00 8c ee 89 31 03 00 00 00
> 000650 00 00 00 00 8a 8b 28 00 78 0c 21 ce 84 dc 12 00
> 000660 a8 84 28 00 8c ee 89 31 00 00 00 00 11 00 00 00
107c107
< 0006b0 e0 01 00 00 43 03 00 00 ff 01 00 00 01 00 00 00
---
> 0006b0 e1 01 00 00 44 03 00 00 ff 01 00 00 01 00 00 00
113c113
< 000710 65 00 00 00 12 00 00 00 0e 00 00 00 0f 00 00 00
---
> 000710 65 00 20 00 12 00 00 00 0e 00 00 00 0f 00 00 00
A chunk of data seems to be of the form:
Type (4 bytes)
Index (4 bytes)
Unknown (4 bytes)
Length (4 bytes)
?? (3 bytes)
Followed by the structure for the Type, so LineTo always starts with
coordinates. Then it may end with a 4-12 byte trailer.
Floating point numbers are 8-byte little-endian IEEE-754.