Optimizing SHA-1 Performance on OS X

At work we need to do some SHA-1 verification on application startup, naturally we want it to run as fast as possible — no user likes their browser bounce too many times on the dock before showing up.

The initial implementation is extremely naïve one, yet quite portable. It’s base on Chromium’s (which our code base is built upon) portable SHA-1 implementation: we first read the entire file into a std::string, then pass this string to SHA1HashString(), job done. Simple, right?

Unfortunately, for the files we have it will take at least 280 milliseconds (61 MB/s) on a rather quick (2.6 GHz Core i7) desktop machine, among which about 30 ms was spent on file reading and the rest is on SHA-1. Needless to say, the memory footprint is quite big given the file sizes we have.

First step of using any other SHA-1 function would be decouple the ReadFileToString() function into normal stdio file reading routine:

FILE* file = fopen(path, "rb");
char buf[1 << 16];
size_t len;
// SHA-1 context initialization.
while ((len = fread(buf, 1, sizeof(buf), file)) > 0) {
// SHA-1 context update.
}
fclose(file);
// SHA-1 context done and get the value.

By doing this we can already save quite a lot of time in std::string concatenation, now the time spent on file reading is down to 5 ms. There is not much room for further improvement, let’s see how we can improve the SHA-1 performance.

A well known fast SHA-1 implementation is the one written by Linus Torvalds for GIT, it’s called block-sha1. The code is derived from the Mozilla NSS library but Linus claimed that he has rewritten it entirely. This performed indeed quite well, the actual calculation time is now down to 60 ~ 65 ms (246 MB/s).

However block-sha1 license is not entirely clear for our closed source use, Linus said that he wouldn’t mind to license it as MPL but so far we still consider it’s licensed as rest of GIT since it resides in its repository. We obviously can’t link to GPLv2 licensed code in closed source software.

Another alternative is Steve Reid’s SHA-1 implementation in C, which is completely public domain code. It performances quite fast as well, around 88 ms for us, which equals to 181 MB/s.

Now that we have a good backup plan, I tried the SHA-1 implementation from Mozilla as well, it takes almost as long as Steve Reid’s implementation, no real improvement here but there is not much burden in license for us either.

I would like to try OpenSSL’s implementation as well, since according to Improving the Performance of the Secure Hash Algorithm (SHA-1), it has more optimized implementation for Intel SIMD instructions (SSE3). However I didn’t managed to get the project build with OpenSSL due to some other complications. Also because I managed to find a better alternative than fiddling with OpenSSL’s fragile API: the CommonDigest library from Apple. It performed much better than block-sha1: only takes 40ms (400 MB/s). From the source released, Apple seemed to be using the cross-platform OpenSSL implementation here as well, but it still would be nice to see how does it compare with OpenSSL library shipped with the system. I will try to post some results in upcoming days.

Preserving Extended Attributes on OS X

When codesigning a Mach-O file (OS X executables or libraries), the signature information will be stored in the file itself through some Mach-O extension. When codesiging a bundle (.app or .framework), _CodeSignature directory will be created. But what happens when you codesigning a plain text file? Signature information will be stored in extended attributes. Because of that, when packaging or copying files like those, you would expect the tools to preserve extended attributes. Not all of them do that by default.

tar on OS X preserves extended attributes by default, both archive and unarchive. But zip doesn’t, a better replacement is ditto -k, ditto can be used as a replacement for cp as well, though cp in OS X preserves extended attributes by default.

When using rsync, -E or --extended-attributes will make sure it copies extended attributes.

When creating a dmg with hdiutil, keep in mind the makehybrid command will lost extended attributes, so you will have to use alternative ways.

codesigning

Codesigning is one of the worst issues we had been having since we started working on the new Opera for Mac. How Apple managed to screw this up never ceased to amaze us.

Since yesterday morning our build servers started to get CSSMERR_TP_NOT_TRUSTED error while code signing the Mac builds. Well, we didn’t notice until trying to release the new Opera Next build in the afternoon, which is obviously a bad timing. When it happened, immediate reaction was search for it in Google, unfortunately, when it happened words haven’t been spread yet so all results we got were from early 2009 ~ 2011, about some intermediate certificates missing, which completely mislead us. We spent a couple of hours inspecting certificates on all 3 of our Mac buildbot servers, none of them seemed wrong. One of my colleagues tried to resign a package locally with certificates/keys installed, got the same error as well.

Fortunately our build server didn’t get the same error every time so we managed to get a build for release.

When I later did a search for the same keyword but limit the results in last 24 hours, we finally found the real answer to the problem this time. According to this discussion:

Apple timestamp server(s) after all that is the problem here. If I add the --timestamp=none option, codesign always succeeds.

I have exactly the same problem. Probably Apple got two timeservers, with one broken, and a 50% chance for us to reach the working one.

And it worked for us perfectly as well. The only thing I didn’t know was whether it’s safe to release a build without requesting a timestamp (or where can we find other trusted timestamp servers).

This morning I woke up and saw this summary about yesterday’s incident.

According to Allan Odgaard (the author of TextMate):

As long as the key hasn’t expired, there should be no issue with shipping an app without a date stamp, and quite sure I have shipped a few builds without the signed date stamp.

That at least give us some confidence that if such incident happen again, it shouldn’t be a big issue to turn timestamp off.

Update: More explanations from Apple:

The point of cryptographic timestamps is to assist with situations where your key is compromised. You recover from key compromise by asking Apple to revoke your certificate, which will invalidate (as far as code signing and Gatekeeper are concerned) every signature ever made with it unless it has a cryptographic timestamp that proves it was made before you lost control of your key. Every signature that does not have such a timestamp will become invalid upon revocation.

opf-cc: epub 和 mobi 的自动简繁转换工具

写一个自动转换 epub 或者 mobi 格式文件的中文简繁体的工具是我一直想做的事情,因为有不少格式内容精美的书籍资源都只有繁体中文版本,而我又不习惯长篇阅读繁体,所以常常要手工转换再用 calibre 封装,不胜其烦,相信其他人也有类似需求。

上周末有空,就用 Python 写了 opf-cc 这个项目,是 Open Packaging Format Chinese Conversion 的缩写,因为 epub 和 Amazon 采用的 mobi 都只是封装方式,具体的文件布局都是按照 OPF 规范的。这里说说实现的思路。

简繁转换这个问题相对好解决,有现成的 OpenCC 在处理多繁一简或者多简一繁的问题上已经很完善了,所以就稍微修改了一下 OpenCC 的代码直接拿来用了,修改都作为 pull request 已经提交到上游了。

解包 epub 比较简单,因为 epub 实际上就是 zip 压缩包,所以用 Python 的 zipfile 模块直接就可以解压。mobi 的解包稍微麻烦一些,如果不用 calibre 那一套庞大的库,mobiunpack 就是最好的选择。

解包后需要找到应该转换的文件,比较麻烦的地方是有的目录中 href 到的文件名本身就是繁体,如果直接整个目录文件一起转换,就得把文件也对应改名,比较麻烦,这里我尝试用 lxml 来解析目录文件,挑出文本来调用 OpenCC 的 Python 模块进行转换,对于 href 属性的内容则不转换。

重新打包 epub 也简单,用 OS X 的 zip 工具一压就可以了 (更新: fishy 提供了不依赖单独 zip 工具而是直接用 Python 的 zipfile 的实现),但 mobi 的打包比较麻烦,要么用 calibre 要么用 Amazon 提供的 KindleGen,好处是下载安装一个二进制程序就可以了,坏处是生成的文件大小要比原来的文件大一倍有余,calibre 就没这个问题 (更新: 经 byelims 推荐使用 kindlestrip 来处理 KindleGen 生成的文件,可以去掉冗余的数据)。

总的说来这个项目还有不少可以改进的地方,除了上述两点以外,还有可以加入简体转繁体的功能,也就是给 OpenCC 传一个不同的参数的事而已。不过我设想该有的功能都已经有了,具体应用的时候遇到什么问题再拿来改进。

另外有兴趣的朋友可以提供更方便的封装,比如用 Automator 或者 ThisService 做成 OS X 的 Service,就可以直接在 Finder 里选中文件右键点击转换了。

My new job

In my previous post I talked about leaving Nokia and the Qt community. So what am I joining? Turned out I’m staying in Oslo for Opera Software. Why? There are a few reasons.

  • When I applied for a job at Nokia, Qt Development Frameworks, I also sent my resume to Opera. But their response came too late (I got a “Your background looks very interesting…” letter after 4 months), by the time I received it, I have finished my interviews at Nokia and almost decided to join them. So I joined the trolls for 2 years. But I have always been thinking what would be like to work on Opera instead. Now I got the chance.
  • I joined the trolls expecting to be a Mac developer, but as it turned out I actually focused on the other interest: typography. It’s wonderful to be one of the few typography engineers in the world, but I still want to sharpen my Cocoa skills from time to time. So now I’m actually working full time as a Mac developer for Opera.
  • Working on typography is my dream job since I was a child. But I had the fear that I was too familiar with internals of Qt thus afraid of change and learning new things. Now I got the exposure of a whole new area and have to quickly learn a lot of new things — exactly I wanted.
  • Doing framework job is a great learning experience, the code has to be so solid and stable and I get to work with many great engineers. But from time to time I wanted to work on some products that are closer to the end user, like a browser. Something that you can go to the party and tell rest of the people what you are working on. (Explaining Qt to non-tech people is not exactly my strength.)

I have worked in the new Opera office for more than a month and so far it has been a really great experience. The work is fast pace, challenging and my colleagues are friendly. The best thing so far is we have free beers on every Friday 🙂 I will probably write again about my job after a few months and tell you more.