octalzeroes

AWK: search and replace with incremented values

published on 21 Feb 2014, tagged with awk

A friend of mine came to me with a problem the other day. He had a html document that displayed lots of pictures that all had empty alt attributes. The alt attribute is used to display a text when you hover over an image on a website. What he wanted was to substitute all of these with alt tags that had incrementing numbers starting from 001 and going to 002, 003 and so on. I figured awk would be the best way to achieve this considering I needed to increment a value.

Instead of using the actual file he supplied me, lets create a quick mockup similar to the one I did the initial work on

<!doctype html>
<html>
<head>
    <title></title>
</head>
<body>
    
<ul class="gallery">
    <li><a class="colorbox" href="images/test213.jpg"><img src="images/thumbnails/test213.jpg" alt=""></a></li>
    <li><a class="colorbox" href="images/test158.jpg"><img src="images/thumbnails/test158.jpg" alt=""></a></li>
    <li><a class="colorbox" href="images/test2.jpg"><img src="images/thumbnails/test2.jpg" alt=""></a></li>
    <li><a class="colorbox" href="images/test6.jpg"><img src="images/thumbnails/test6.jpg" alt=""></a></li>
    <li><a class="colorbox" href="images/test90.jpg"><img src="images/thumbnails/test90.jpg" alt=""></a></li>
</ul>
    
</body>
</html>

The filenames were of no use. If they followed a logical numbering I could probably just have used them as a reference for the alt tags. Instead I needed to create a number that I would increment each time I came across one of these lines.

The one thing they all have in common that I couldn't find elsewhere was class="colorbox". With this piece of information we can make sure that we only match lines which contain the word colorbox. The proper AWK-syntax to match these lines and perform some action is simply /colorbox/ { ... }

What I ended up with was the following

#!/usr/bin/awk -f

/colorbox/ {
  i = sprintf("%03d", ++i)
  sub("alt=\"\"", "alt=\""i"\"")
}; 1
  • match the text "colorbox" and open a block
    • increment variable i and format it as a 3 digit number (the first time the value will be 001)
    • alt="" will be substituted for alt="[value of i]"
  • end block and add ; to separate the commands
  • awk will process one line at a time and 1 is a shortcut for {print} which will print the current line. Since we are outside of the block this will print every line, including the ones where substitution was performed

Let's run the script and output to a new file

% awk -f script.awk gallery.html > gallery2.html

And the result

<!doctype html>
<html>
<head>
    <title></title>
</head>
<body>

<ul class="gallery">
    <li><a class="colorbox" href="images/test213.jpg"><img src="images/thumbnails/test213.jpg" alt="001"></a></li>
    <li><a class="colorbox" href="images/test158.jpg"><img src="images/thumbnails/test158.jpg" alt="002"></a></li>
    <li><a class="colorbox" href="images/test2.jpg"><img src="images/thumbnails/test2.jpg" alt="003"></a></li>
    <li><a class="colorbox" href="images/test6.jpg"><img src="images/thumbnails/test6.jpg" alt="004"></a></li>
    <li><a class="colorbox" href="images/test90.jpg"><img src="images/thumbnails/test90.jpg" alt="005"></a></li>
</ul>

</body>
</html>