jjgod / blog Random notes & thoughts.

opf-cc: epub 和 mobi 的自动简繁转换工具

写一个自动转换 epub 或者 mobi 格式文件的中文简繁体的工具是我一直想做的事情,因为有不少格式内容精美的书籍资源都只有繁体中文版本,而我又不习惯长篇阅读繁体,所以常常要手工转换再用 calibre 封装,不胜其烦,相信其他人也有类似需求。

上周末有空,就用 Python 写了 opf-cc 这个项目,是 Open Packaging Format Chinese Conversion 的缩写,因为 epub 和 Amazon 采用的 mobi 都只是封装方式,具体的文件布局都是按照 OPF 规范的。这里说说实现的思路。

简繁转换这个问题相对好解决,有现成的 OpenCC 在处理多繁一简或者多简一繁的问题上已经很完善了,所以就稍微修改了一下 OpenCC 的代码直接拿来用了,修改都作为 pull request 已经提交到上游了。

解包 epub 比较简单,因为 epub 实际上就是 zip 压缩包,所以用 Python 的 zipfile 模块直接就可以解压。mobi 的解包稍微麻烦一些,如果不用 calibre 那一套庞大的库,mobiunpack 就是最好的选择。

解包后需要找到应该转换的文件,比较麻烦的地方是有的目录中 href 到的文件名本身就是繁体,如果直接整个目录文件一起转换,就得把文件也对应改名,比较麻烦,这里我尝试用 lxml 来解析目录文件,挑出文本来调用 OpenCC 的 Python 模块进行转换,对于 href 属性的内容则不转换。

重新打包 epub 也简单,用 OS X 的 zip 工具一压就可以了 (更新: fishy 提供了不依赖单独 zip 工具而是直接用 Python 的 zipfile 的实现),但 mobi 的打包比较麻烦,要么用 calibre 要么用 Amazon 提供的 KindleGen,好处是下载安装一个二进制程序就可以了,坏处是生成的文件大小要比原来的文件大一倍有余,calibre 就没这个问题 (更新: 经 byelims 推荐使用 kindlestrip 来处理 KindleGen 生成的文件,可以去掉冗余的数据)。

总的说来这个项目还有不少可以改进的地方,除了上述两点以外,还有可以加入简体转繁体的功能,也就是给 OpenCC 传一个不同的参数的事而已。不过我设想该有的功能都已经有了,具体应用的时候遇到什么问题再拿来改进。

另外有兴趣的朋友可以提供更方便的封装,比如用 Automator 或者 ThisService 做成 OS X 的 Service,就可以直接在 Finder 里选中文件右键点击转换了。

My new job

In my previous post I talked about leaving Nokia and the Qt community. So what am I joining? Turned out I’m staying in Oslo for Opera Software. Why? There are a few reasons.

I have worked in the new Opera office for more than a month and so far it has been a really great experience. The work is fast pace, challenging and my colleagues are friendly. The best thing so far is we have free beers on every Friday :) I will probably write again about my job after a few months and tell you more.

< /troll >

Two years ago, I started my first job at Nokia, Qt Development Frameworks. Originally planned to become a Mac developer, I ended up working on the text layout and font rendering part of Qt. Not exactly carried out what I wanted to do, it is still a fantastic job with many good memories. Some of my favorite parts of this job are:

Things that didn’t work that well includes:

Anyway, after these wonderful two years, I think it’s time to move on. I want to work on something else now. So I didn’t join most of my colleagues transferring to Digia, but I sincerely think that they have a very good chance of succeed and making “Qt Everywhere” more true than it is. It has been a great ride and if anyone is looking for a job and Digia is hiring, I can recommend it without any hesitation.

I will try to cover what’s next for me in the next post, stay tuned 😉

Thoughts on Learnable Programming

After reading Bret Victor’s new essay Learnable Programming and some of the responses, the idea keep lingering in my head.

While his famous lecture Inventing on Principle tackled the idea of making new principles, people are mostly interested in applying that visualization method on IDEs and such. Although most of these efforts are largely experimental, I still find it harsh to dismiss them and consider them as a failure. Live coding is not the point of Victor’s lecture or essay, but it is still immensely better than a dumb editor and cold console that I used to use to learn programming.

(My girlfriend, who majored in landscape design, learned most JavaScript concepts with the rudimentary web editor from Codecademy. She found it engaging and helpful. I doubt she will find it that way if she was given a Unix console and gcc, checking a.out output every time a change was made.)

In the area of learning a new way of thinking, things as simple as live coding can be useful.

What’s more important is how do we progress after grasping basic concepts like iteration and function? While most people think learning programming as a one time effort, I have programmed for more than 10 years and I’m still constantly learning.

Malcolm Gladwell’s Outliers brought us the concept of 10,000 hours of practice. He told us that Bill Joy became a master after practicing programming for 10,000 hours. But what exactly has Bill Joy done in his practice is unknown to us. We all knew that Bill has coded the vi editor and some of the BSD network stack. Did he start practicing from writing a smaller subset of vi or BSD?

The essential question here is how do we learn higher levels of abstraction and design systems more complicated.

First, I personally believe that we need to understand the details to do higher level designs. You won’t be able to design an efficient system without knowing the implementation costs and performance impact of critical code paths. That’s why I usually choose to dive to the bottom and solve/evaluate a few critical issues before looking at the big picture.

However, human mind is not a precise machine that can assemble low level details into high level constructs. After mastering the details sometimes we need to take it away, clear our minds to see a bigger picture, otherwise our heads will be occupied by low level details and have no time to abstract. Contradicted as it is, visualization becomes a really useful tool to hide the details for us.

With visualization as simple as a pie chart, we will suddenly able to free our minds, stand on our previous results and reach higher fruits. That, I think, is the real value of Bret Victor’s idea. What I disagree with him is that his idea doesn’t have to be a groundbreaking rule to revolutionize our ways of programming, the existing tools and IDEs can still equip that as a weapon, to help us tackle even more complex problems.

Imagine code visualization as an amplifier. Without that it is possible to work on certain issues, but with that you are suddenly able to work on harder issues in a more efficient manner.

How to make that really work? Let’s start from small. Take your favorite IDE for example, Xcode can have built-in visualizer for native Cocoa classes such as NSString, NSArray, NSColor and NSFont. The code to show them is already there, just need to find a way to present. Qt Creator can have visualizer for QString, QTransform, QColor and QFont, etc.

For visualizing the states of a program, it works just like our existing debugger with variable watching feature. Except that is presented in a structural way, specific presentation can be given by the programmer himself, just like writing a test case. We can have a drag-and-drop layout interface to do that.

Some of Bret Victor’s ideas (like parameter tagging) can already be achieved with a static analyzer such as clang. Others can only be done in execution. How to speed up execution to avoid setting up a lot of states is the most interesting problem. I think we can try to reduce the amount of input and take an independent block of code (functions without side effects) for examination in an isolated sandbox environment, supplying the required parameters in a simpler way just like preparing for test data sets.

In summary, I still expect that this visualization tool as a programming assistant. An assistant that can take a piece of code, walk through its execution and explain it to me how it works. Even if can only tell how does one variable transits in a loop, it will still be much better than me evaluating it by hand with paper and pencil. In short, hand the repeating tasks over to your machine and focus on something more creative, that’s what we humans are good at.


相比熟知的中西欧国家,北欧对于国内的朋友较为陌生,比如我有的朋友从来记不住我去的是挪威,分不清瑞典与瑞士的也大有人在。不独国人,据一位美国同事说,他的同乡一直以为他去的是南卡罗来纳周的 Norway。近期见诸媒体的有一些关于北欧的报道,比如南方周末的这篇《高物价高税负,丹麦为何走得比美国好》和纽约时报的《Investors Seek Out Safer Shores》也算是管中窥豹。来挪威已经两年,别的北欧国家不敢说,至少对于挪威可以略谈谈我自己了解到,而国内甚少报导的部分,权作参考。


← Before After →