niedziela, 30 czerwca 2013

Syntax highlighting with Markdown and Prettify

In my last post I described how TeX4ht can be used to do syntax highlighting of source code for a web publishing. Today I will tell you something about the solution I chose for my blog. In order to keep the text of my posts and the related code in a single place I am using Markdown.

Out of the box Markdown supports code block syntax which wraps up source code with pre and code tags. It looks nice but colors would definitely improve the readability.

class Program
{
    static void Main(string[] args)
    {
        // This is comment
        var text = "This is text";
        var number = 12345;
        Console.WriteLine(text + number.ToString());
    }
}

Google Code Prettify

Google Code Prettify is the project which can help us achieve just that. This is a Javascript library which when loaded, rewrites existing code sections and improves their style. One of the best features is that there is no need to explicitly specify the language as it can be automatically detected. If the snippet is too short for the detection to work you may specify it explicitly but most of the time it just works.

Load Prettify

In order to turn on Prettify on a site it is necessary to reference the script. It will load the CSS, Javascript modules and look for the marked code sections to fix.

<script src="https://google-code-prettify.googlecode.com/svn/loader/
    run_prettify.js"></script> 

Mark code sections

Prettify will only touch code sections which were marked. Two different markers are supported.

In the normal HTML it is best to add a prettyprint class to the <pre>, <code> or <xmp> elements:

<pre class="prettyprint">
source code here
</pre>

If you do not have access to the <pre> tag, which is a case for Markdown, there is another way. The code needs to be preceded with a special instruction:

<?prettify?>
<pre>
source code here
</pre>

Markdown and Prettify

According to the documentation, in order to add block-level HTML elements in Markdown they have to be surrounded with blank lines and they should not be indented with tabs or spaces. Knowing that one could try to enable prettify with the following markup.

<?prettify?>

    source code

Unfortunately <?prettify?> is not recognized by the Markdown translator and the effect is pretty far from what was intended. Prettify instruction gets translated to a HTML section. The left and right angle brackets get escaped. Because the marker is effectively gone, the source code stays plain.

<p>&lt;?prettify?&gt;</p>
<pre><code>source code</code></pre>

We could fix it with some Javascript which would run on the page load and translate <div class="prettify"> tags, which are recognized by Markdown, with <?prettify?>. This wouldn't be too hard, but there is much easier solution!

In the prettify.js, around the line 883 there is a very interesting comment about how the tags <?tag?> are parsed by the HTML 5. The part 'nt === 8' was just what we were looking for. It turns out that in some browsers it can be interpreted as a normal comment node <!--tag-->, thus both nodes have to be treated the same way by the library. This is a huge deal, especially for Markdown because comments are supported!

    var nt = preceder.nodeType;
    // <?foo?> is parsed by HTML 5 to a comment node (8)
    // like <!--?foo?-->, but in XML is a processing instruction
    var value = (nt === 7 || nt === 8) && preceder.nodeValue;

Conclusions

In order to mark a Markdown code block to be processed by a Prettify one can add <!--?prettify?--> element before the block. Of course there needs to be a single empty line for everything to work.

The following markup:

<!--?prettify?-->

    class Program
    {
        static void Main(string[] args)
        {
            // This is comment
            var text = "This is text";
            var number = 12345;
            Console.WriteLine(text + number.ToString());
        }
    }

produces the following result:

class Program
{
    static void Main(string[] args)
    {
        // This is comment
        var text = "This is text";
        var number = 12345;
        Console.WriteLine(text + number.ToString());
    }
}

Although this solution relies on some undocumented features I think that it is a reliable one. I will use it to write my blog.

References


This post and all the resources are available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master

niedziela, 23 czerwca 2013

Syntax highlighting with TeX4ht

When I was evaluating different options for a blog development I spent some time on the TeX4th. Although, I haven't chosen this technology I found it very interesting and I would like to share its goodness.

One of the important aspect of all the blogs about programming is how they display source code snippets. As always there is no one answer how to do it. Some people just wrap their code in the <pre> and code tags. Others care more about the appearance of their posts and highlight the syntax accordingly to the programming language they use. I wanted the code I share to look good. That's why I draw my attention the TeX4ht.

Listings package

In LaTeX there is a listings package which can be used to format source code. It offers environment similar to verbatim but with many parameters to customize the output.

This is an example of how one can add a code block to a LaTeX article.

\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}

\usepackage{listings}
\lstset{
    language=[Sharp]C,
    basicstyle=\ttfamily\small,
    identifierstyle=\sffamily,
    keywordstyle=\sffamily\bfseries,
    commentstyle=\rmfamily,
    stringstyle=\rmfamily\itshape,
    numberstyle=\scriptsize,
    showstringspaces=false,
    tabsize=2,
    numbers=left,
}

\begin{document}
\begin{lstlisting}[float, caption={Sample code}]
class Program
{
    static void Main(string[] args)
    {
        // This is comment
        var text = "This is text";
        var number = 12345;
        Console.WriteLine(text + number.ToString());
    }
}
\end{lstlisting}
\end{document}

Once compiled to PDF it looks very nice. Even though everything is black and white every part of the code has its unique style.

TeX4ht

The LaTeX document presented in the previous listing can be compiled to the HTML using TeX4ht with the following command

>>htlatex Sample.tex

Unfortunately the output produced by default is not as pretty as it was in the PDF. The fonts have their style but the code is no longer aligned. There is no space between numbers and text. Comments are not aligned with the rest of the code.

Listing 1: Sample code
1class Program 
2{ 
3    static void Main(string[] args) 
4    { 
5        // This is comment 
6        var text = This is text; 
7        var number = 12345; 
8        Console.WriteLine(text + number.ToString()); 
9    } 
10}

Listing package supports four different modes of alignment. By default it uses a fixed mode where a character is a single unit of output and they are aligned in columns. This mechanism does not port to HTML. in order to achieve the similar effect one should use monospace fonts. However this has its own problems because in LaTeX this corresponds to a typewriter (/ttfamily) font which cannot be styled.

As I mentioned it in the previous post the best solution I found was at the StaskExchange

Instead of trying to force TeX4ht to produce different styles for the listing generated with listings package it is easier to override the style used in the output. For this to work all the styles used in the listings should be unique (eg. basicstyle, identifierstyle, ...). If you look at the lstset definition of the first listing, you will see that it satisfies this requirement.

The next step was to define the CSS configuration. In order to do it I used Internet Explorer Developer Tools to select elements and capture their classes. Then I was able to create a private configuration File for the TeX4ht.

\Preamble{html} 
\begin{document} 
  % basicstyle
  \Css{div.lstlisting .cmtt-10 {font-family:monospace; color:DimGray}} 
  % identifierstyle
  \Css{div.lstlisting .cmss-10 {font-family:monospace; color:Black}} 
  % keywordstyle
  \Css{div.lstlisting .cmssbx-10 {font-family:monospace; color:Blue}} 
  % commentstyle
  \Css{div.lstlisting .cmr-10 {font-family:monospace; color:Green}} 
  % stringstyle
  \Css{div.lstlisting .cmti-10 {font-family:monospace; color:DarkRed}} 
  % numberstyle
  \Css{div.lstlisting .cmr-8 {display:inline-block; width:20px}} 
\EndPreamble 

Please notice custom style for the div.lstlisting block. This hasn't been mentioned on the StackExchange but it is required for the line numbering to work.

In order to include the configuration file I used slightly modified command line.

>>htlatex Sample.tex Sample.cfg

Finally it all worked. The listing produced has line numbering. All the elements of the syntax are highlighted and everything is aligned exactly the same way as in the source code.

Listing 1: Sample code
1class Program 
2{ 
3    static void Main(string[] args) 
4    { 
5        // This is comment 
6        var text = This is text; 
7        var number = 12345; 
8        Console.WriteLine(text + number.ToString()); 
9    } 
10}

This post with all the resources is available on GitHub:

https://github.com/StanislawSwierc/it-is-not-overengineering/tree/master

niedziela, 16 czerwca 2013

Blog development plan

I'm quite a new to blogging but I know a lot about the software development. Are these two activities that different? They look alike to me:

  • You write down your ideas in a language of your choice.
  • You need to adhere to some rules like grammar.
  • When a post is done it is pushed to a public site.

Hey, that's exactly what software developers do all the time!

Let's have a look at how this idea can be put in use.

Syntax

One of the most important decision is about the syntax used to write posts. Ideally the syntax should be very light so that the writer focuses on the content. Additionally it should have some capabilities to organize the document.

I've taken into consideration the following options:

  • HTML
  • LaTeX + LaTeX2HTML
  • Markdown

HTML

I know that my posts will need to be converted to HTML at some point in time so why not just start with it. This language has a great tooling with WYSWIG editors. Moreover, it is the most powerful option. With pure HTML I should be able to write anything I like.

The only problem is that it sometimes can be hard to read with many different tags obscuring the picture. This is particularly visible when it comes to embedded source code. The problem is even worse because in HTML there are two characters that demand special treatment: < and &. Left angle brackets are used to start tags whereas ampersands are used to denote HTML entities. In order to use them as literal characters, it is necessary to escape them as entities, e.g. &lt;, and &amp;. Even if they are inserted by an editor they will exist in the source, thus making it harder to read.

LaTeX + Tex4HT

An alternative solution, which should be attractive to all academics, is the LaTeX. With some tools like HTLaTeX it is possible to compile documents to HTML.

I studied at a university where using LaTeX it is not mandatory but most of the instructors preferred this format. Therefore, I learned how to write LaTeX articles before I grasped the HTML. I still had the right environment which was a portable distribution of MikTeX with some of my favourite packages. In order to evaluate this option I created a sample post which contained some source code listings and images.

During that process I found few resources which were particularly helpful:

Although it worked, it wasn't easy. The biggest problem I had was with the source code formatting. The listings package I used to produced beautiful listings in a PDF but when I run the 'tex4ht' they all looked much worse. Fonts had no longer constant width and nothing was aligned as it should. It turned out that listings is quite advanced and it has it's own algorithm to organize everything into columns. That's how this works with any any font you use but this solution didn't worked for HTML.

I fixed it by changing the base font to be typewriter ('/ttfamily'). Everything was aligned again but I lost the bold style of the keywords. It made me think about using using colors instead. After all, this document won't be printed!

I found a very interesting answer at the StackExchange.

This is the key idea:

Imho a more simple approach is with fonts: if every style is connected to a different font then tex4ht surrounds the chars with classes which you can set through css.

-- Ulrike Fischer

It worked like charm. I still had to compile the document few times and inspect the html to capture all the class names that I need to use in the css but it fast. Eventually I came up with a nicely colored source code listing.

In summary, the experience wasn't bad but I had a feeling that I had to search for solutions and workarounds too often. I decided to look for something different.

Markdown

The third option that I took into consideration is Markdown. Initially I've been using it at StackExchange without knowing its name. I would just write a question or an answer and discover the syntax accidentally by typing and looking at the 'Preview' section. It was possible because of the philosophy it was created with.

Philosophy

Markdown is intended to be as easy-to-read and easy-to-write as is feasible.

Readability, however, is emphasized above all else. A Markdown-formatted document should be publishable as-is, as plain text, without looking like it's been marked up with tags or formatting instructions.

-- John Gruber

One of the biggest advantage of this syntax, which was probably also admired by the creators of StackOverflow, is the code block. You can just paste your code inline, indent it by 4 spaces or 1 tab and it will be properly formatted. No tags, no commands, just indentation - perfect!

Most of the code I write is in C#. Because classes are defined inside a namespace block they are indented by default so there is no extra action needed. The code can be copied as it is and it will be converted into HTML.

This was one of the reasons why I selected Markdown as the language for my blog.

Version Control

I believe that the version control system plays very important role in every software project. Whenever I start something new and I know that it stick around for longer than a day I create a repository. Blog definitely falls into this category. The decision about which VC system to use was quite easy. For all the private work I use Git. I've got all the tools installed and account on a GitHub to backup my repository.

With a public repository there is theoretical a chance that somebody will send me a pull request to fix something in the post but I don't think this will happen. Not because the code I write is flawless but blog is a personal thing. Nevertheless, contributions are more than welcome.

Conclusions

In summary, I will write all my posts using Markdow syntax. They will under version control system and available in two different places. Sources will be saved in the GitHub and the HTML version will be published in the Blogger.

niedziela, 9 czerwca 2013

The first post

I consider myself a software developer. I like to get things done but I'm most productive when I'm having fun as well. That's why from time to time I come up with solutions that at a first sight, seem like an overengineering. Of course they fit perfectly to the problem but it takes time to understand why. Usually I'm discussing them inside my team if they are specific to a given project or posting something on stackoverflow. However, the there is hardly any record of that. This blog will solve this problem. It will act as a single place where I'll be publishing why I think that a given solution is not overengineering.

Hence the name.

I hope you find it both entertaining and useful!