Pokazywanie postów oznaczonych etykietą tex. Pokaż wszystkie posty
Pokazywanie postów oznaczonych etykietą tex. Pokaż wszystkie posty

niedziela, 23 czerwca 2013

Syntax highlighting with TeX4ht

When I was evaluating different options for a blog development I spent some time on the TeX4th. Although, I haven't chosen this technology I found it very interesting and I would like to share its goodness.

One of the important aspect of all the blogs about programming is how they display source code snippets. As always there is no one answer how to do it. Some people just wrap their code in the <pre> and code tags. Others care more about the appearance of their posts and highlight the syntax accordingly to the programming language they use. I wanted the code I share to look good. That's why I draw my attention the TeX4ht.

Listings package

In LaTeX there is a listings package which can be used to format source code. It offers environment similar to verbatim but with many parameters to customize the output.

This is an example of how one can add a code block to a LaTeX article.

\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}

\usepackage{listings}
\lstset{
    language=[Sharp]C,
    basicstyle=\ttfamily\small,
    identifierstyle=\sffamily,
    keywordstyle=\sffamily\bfseries,
    commentstyle=\rmfamily,
    stringstyle=\rmfamily\itshape,
    numberstyle=\scriptsize,
    showstringspaces=false,
    tabsize=2,
    numbers=left,
}

\begin{document}
\begin{lstlisting}[float, caption={Sample code}]
class Program
{
    static void Main(string[] args)
    {
        // This is comment
        var text = "This is text";
        var number = 12345;
        Console.WriteLine(text + number.ToString());
    }
}
\end{lstlisting}
\end{document}

Once compiled to PDF it looks very nice. Even though everything is black and white every part of the code has its unique style.

TeX4ht

The LaTeX document presented in the previous listing can be compiled to the HTML using TeX4ht with the following command

>>htlatex Sample.tex

Unfortunately the output produced by default is not as pretty as it was in the PDF. The fonts have their style but the code is no longer aligned. There is no space between numbers and text. Comments are not aligned with the rest of the code.

Listing 1: Sample code
1class Program 
2{ 
3    static void Main(string[] args) 
4    { 
5        // This is comment 
6        var text = This is text; 
7        var number = 12345; 
8        Console.WriteLine(text + number.ToString()); 
9    } 
10}

Listing package supports four different modes of alignment. By default it uses a fixed mode where a character is a single unit of output and they are aligned in columns. This mechanism does not port to HTML. in order to achieve the similar effect one should use monospace fonts. However this has its own problems because in LaTeX this corresponds to a typewriter (/ttfamily) font which cannot be styled.

As I mentioned it in the previous post the best solution I found was at the StaskExchange

Instead of trying to force TeX4ht to produce different styles for the listing generated with listings package it is easier to override the style used in the output. For this to work all the styles used in the listings should be unique (eg. basicstyle, identifierstyle, ...). If you look at the lstset definition of the first listing, you will see that it satisfies this requirement.

The next step was to define the CSS configuration. In order to do it I used Internet Explorer Developer Tools to select elements and capture their classes. Then I was able to create a private configuration File for the TeX4ht.

\Preamble{html} 
\begin{document} 
  % basicstyle
  \Css{div.lstlisting .cmtt-10 {font-family:monospace; color:DimGray}} 
  % identifierstyle
  \Css{div.lstlisting .cmss-10 {font-family:monospace; color:Black}} 
  % keywordstyle
  \Css{div.lstlisting .cmssbx-10 {font-family:monospace; color:Blue}} 
  % commentstyle
  \Css{div.lstlisting .cmr-10 {font-family:monospace; color:Green}} 
  % stringstyle
  \Css{div.lstlisting .cmti-10 {font-family:monospace; color:DarkRed}} 
  % numberstyle
  \Css{div.lstlisting .cmr-8 {display:inline-block; width:20px}} 
\EndPreamble 

Please notice custom style for the div.lstlisting block. This hasn't been mentioned on the StackExchange but it is required for the line numbering to work.

In order to include the configuration file I used slightly modified command line.

>>htlatex Sample.tex Sample.cfg

Finally it all worked. The listing produced has line numbering. All the elements of the syntax are highlighted and everything is aligned exactly the same way as in the source code.

Listing 1: Sample code
1class Program 
2{ 
3    static void Main(string[] args) 
4    { 
5        // This is comment 
6        var text = This is text; 
7        var number = 12345; 
8        Console.WriteLine(text + number.ToString()); 
9    } 
10}

This post with all the resources is available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master

niedziela, 16 czerwca 2013

Blog development plan

I'm quite a new to blogging but I know a lot about the software development. Are these two activities that different? They look alike to me:

  • You write down your ideas in a language of your choice.
  • You need to adhere to some rules like grammar.
  • When a post is done it is pushed to a public site.

Hey, that's exactly what software developers do all the time!

Let's have a look at how this idea can be put in use.

Syntax

One of the most important decision is about the syntax used to write posts. Ideally the syntax should be very light so that the writer focuses on the content. Additionally it should have some capabilities to organize the document.

I've taken into consideration the following options:

  • HTML
  • LaTeX + LaTeX2HTML
  • Markdown

HTML

I know that my posts will need to be converted to HTML at some point in time so why not just start with it. This language has a great tooling with WYSWIG editors. Moreover, it is the most powerful option. With pure HTML I should be able to write anything I like.

The only problem is that it sometimes can be hard to read with many different tags obscuring the picture. This is particularly visible when it comes to embedded source code. The problem is even worse because in HTML there are two characters that demand special treatment: < and &. Left angle brackets are used to start tags whereas ampersands are used to denote HTML entities. In order to use them as literal characters, it is necessary to escape them as entities, e.g. &lt;, and &amp;. Even if they are inserted by an editor they will exist in the source, thus making it harder to read.

LaTeX + Tex4HT

An alternative solution, which should be attractive to all academics, is the LaTeX. With some tools like HTLaTeX it is possible to compile documents to HTML.

I studied at a university where using LaTeX it is not mandatory but most of the instructors preferred this format. Therefore, I learned how to write LaTeX articles before I grasped the HTML. I still had the right environment which was a portable distribution of MikTeX with some of my favourite packages. In order to evaluate this option I created a sample post which contained some source code listings and images.

During that process I found few resources which were particularly helpful:

Although it worked, it wasn't easy. The biggest problem I had was with the source code formatting. The listings package I used to produced beautiful listings in a PDF but when I run the 'tex4ht' they all looked much worse. Fonts had no longer constant width and nothing was aligned as it should. It turned out that listings is quite advanced and it has it's own algorithm to organize everything into columns. That's how this works with any any font you use but this solution didn't worked for HTML.

I fixed it by changing the base font to be typewriter ('/ttfamily'). Everything was aligned again but I lost the bold style of the keywords. It made me think about using using colors instead. After all, this document won't be printed!

I found a very interesting answer at the StackExchange.

This is the key idea:

Imho a more simple approach is with fonts: if every style is connected to a different font then tex4ht surrounds the chars with classes which you can set through css.

-- Ulrike Fischer

It worked like charm. I still had to compile the document few times and inspect the html to capture all the class names that I need to use in the css but it fast. Eventually I came up with a nicely colored source code listing.

In summary, the experience wasn't bad but I had a feeling that I had to search for solutions and workarounds too often. I decided to look for something different.

Markdown

The third option that I took into consideration is Markdown. Initially I've been using it at StackExchange without knowing its name. I would just write a question or an answer and discover the syntax accidentally by typing and looking at the 'Preview' section. It was possible because of the philosophy it was created with.

Philosophy

Markdown is intended to be as easy-to-read and easy-to-write as is feasible.

Readability, however, is emphasized above all else. A Markdown-formatted document should be publishable as-is, as plain text, without looking like it's been marked up with tags or formatting instructions.

-- John Gruber

One of the biggest advantage of this syntax, which was probably also admired by the creators of StackOverflow, is the code block. You can just paste your code inline, indent it by 4 spaces or 1 tab and it will be properly formatted. No tags, no commands, just indentation - perfect!

Most of the code I write is in C#. Because classes are defined inside a namespace block they are indented by default so there is no extra action needed. The code can be copied as it is and it will be converted into HTML.

This was one of the reasons why I selected Markdown as the language for my blog.

Version Control

I believe that the version control system plays very important role in every software project. Whenever I start something new and I know that it stick around for longer than a day I create a repository. Blog definitely falls into this category. The decision about which VC system to use was quite easy. For all the private work I use Git. I've got all the tools installed and account on a GitHub to backup my repository.

With a public repository there is theoretical a chance that somebody will send me a pull request to fix something in the post but I don't think this will happen. Not because the code I write is flawless but blog is a personal thing. Nevertheless, contributions are more than welcome.

Conclusions

In summary, I will write all my posts using Markdow syntax. They will under version control system and available in two different places. Sources will be saved in the GitHub and the HTML version will be published in the Blogger.