June 2010


I must admit that, in my career so far, character encodings have been a pretty insignificant concern. Most of the software I write is focused on small, domestic audiences. So character sets mean 1-byte vs. 2-byte or a couple of garbage-ish characters at the top of some files. But, after reading Joel’s guidance on character sets, I’ve been more alert to them. I have a better understanding of how character sets work, and I’m paranoid about them causing trouble for me, though how exactly they work is still a bit of smoke and mirrors.

All that said, I helped Dave solve a character set problem yesterday.

In the footer of a site we work on is this:

… except on some pages, where it looks like this:

Interesting.

Dave knew that normal aspx pages showed the symbol correctly, while CGI pages showed  before ©. The CGI pages are handled by an ASP.NET handler that I wrote, which is why he came to ask me.

My spidey-sense whispered “character encoding,” so I started trying to figure out what the charsets were. I popped open Chrome’s developer tools and checked the headers on a plain ASP.NET page and an ASP.NET/CGI page.

ASP.NET: Content-type: text/html; charset=utf-8

CGI: Content-type: text/html; charset=ISO-8859-1

Ahha! It is a charset thing!

“But it says © in the source files,” Dave said. So why does charset matter? Does the browser really interpret ‘©’ differently based on which charset it’s using? Or is ASP.NET being “helpful” again?

I poked through the skin files, finding this:

<asp:Label
   SkinID="FooterCopyrightText"
   Text="Terms and Conditions &copy; 2009 SEP"
   runat="server" />

It looks OK, but because ASP.NET is the consumer of the skin file, ASP.NET is interpreting the &copy; entity and storing it in a string as the character with code point A9. When it writes out the page, it doesn’t bother figuring out whether to make it an entity again (I wouldn’t either), so it outputs the UTF-8 encoding for A9, which is C2A9. To complete our comedy, and in an effort to avoid garbling the CGI output (which is, in fact, more important to get right than the copyright symbol in the footer of the page), the CGI handler is changing the Content-type header to match what the CGI program says it is (ISO-8859-1). In ISO-8859-1, C2A9 is ©.

The quick fix was to change the &copy; to &amp;copy; in the skin file so that ASP.NET actually renders &copy;. The complete fix will be either to align the encoding used by ASP.NET and CGI, or to modify the CGI handler to translate the CGI output from ISO-8859-1 (or whatever encoding it’s using) to UTF-8.

I built version 0.9.1 of git-tfs today. The notable changes are

  1. It should work seamlessly with the TFS client libs that come with VS2008 and VS2010.
  2. It has a new “quick-clone” command.

The quick-clone command is used exactly like clone. The difference is that, while clone will chug for hours trying to get an exact replica of all of the changesets in the TFS repository, quick-clone will just grab a snapshot from TFS.

Look for it on the downloads page at github.

I’m working on a prototyping project that has been going on for a few years. This year, its source got migrated to git. For the last month or two, all of the interesting action has been in one subdirectory of the repository. We wanted to split the work off into another repository that didn’t have all the old cruft. It wasn’t too hard.

To do it, I took advantage of git’s internal structures. Conceptually, I did the opposite of a subtree merge… so, it was a subtree extract. Our subdirectory has always been in the same place, so the combination of git log HEAD -- [subtree] and git ls-tree [commit] [subtree] got me a list of commits and the tree IDs for the subtree I was extracting. From there, I used commit-tree to build up the new history for the tree.

That description makes it sound like I should have had about a 5-line shell script. But there are obviously some details left out. If you want everything, check out extract_subtree.rb.

If you decide to use this script, please be careful with it. It shouldn’t destroy anything, but it might mess up your repo if something isn’t set up right. Also, this won’t deal with the .gitmodules file, so if you use submodules, you’ll need to manually build your .gitmodules file again.

If you want to know more about git’s internals, check out Scott Chacon’s ProGit book.