Unix pioneer Brian Kernighan still loves AWK after all these years

0

Brian Kernighan is perhaps the closest thing to a living legend. He coined the term “Unix” in 1970 and is credited with his pioneering work at Bell Labs (where the operating system originated). As the co-author of Unix’s AWK tool, even Kernighan’s name lives on in our development environments, since Kernighan’s last initial provided both the “k” in the AWK name – and the “K” when people quote the iconic 1978 “K&R” book.” about C programming.

Earlier this month, Kernighan gave an interview to YouTube channel Computerphile (which has 2.18 million subscribers). Talking to David F. Brailsford, professor of computer science at the University of Nottingham, Kernighan weighed in on everything from the best programming for 10 years to his memories of AWK’s “very short development cycle in 1977”.

In describing AWK’s usefulness, succinctly and clearly, Kernighan becomes almost an accidental evangelist, calling AWK “an example of what is the right tool for the job”.

And what kind of work is that? “Something you know in your heart, you could probably write it in one line if you had the right language. AWK is the language that lets you write it in one line. Because it supports a much of the baggage you would otherwise need in another language.

Kernighan admits that Python is the language if you only choose one for the rest of your life. But with Python, “you need to know how to get the input? How do you break it up into different components? How do you write it? All of these things happen for free in AWK, and that’s one of the reasons why AWK programs tend to be very, very short compared to programs in other languages.

“They run about the same speed as they would in Python.”

Enter Unicode

But six minutes later, Brailsford asks a telling question: Is Kernighan keeping AWK under active maintenance? Kernighan says he’s been on GitHub for “a good while” now, with no official release schedule, and credits longtime Unix programmer Arnold Robbins with “most of the active work.” Robbins is also the current maintainer of the GNU Project’s version of AWK, and Kernighan describes Robbins as “incredibly good at this kind of stuff” and “a very good friend…I think of him as does the person keeping an eye on it, for the most part. Robbins even augmented Kernighan’s own test suites for AWK.

But Kernighan didn’t completely abandon AWK development.

“It’s always been annoying that AWK only works with ASCII, or maybe 8-bit, input, but it doesn’t support Unicode at all. And so a few months ago,” Kernighan said.

Unicode is the successor to the much more limited older ASCII character set, incorporating the world’s languages ​​and emojis

With a laugh of anticipation, Kernighan said, “I’ve spent some time working with an incredibly old program – and I’ve got it at this point where it will actually handle UTF-8 [a Unicode subset] input and output, so you can have regular expressions that, you know, take Japanese characters or something. And it seems to work fine.

Kernighan notes that Robbins has also worked on egrep, a tool that has a pattern recognition mechanism with parsing that is “essentially the same” as AWK’s. “The code is pretty — what’s the correct tech word?” — impenetrable.” Kernighan laughs. “But luckily I was able to understand enough of it to be able to integrate UTF processing and Unicode processing inside.”

Kernighan describes its updates as “sort of in the middle release” on GitHub. But when asked if he’s still working on AWK, 45 years later, the answer is yes: “It was real work, trying to understand the old code and inserting something into it. I think I’m right, but… more testing is needed.

“The other thing I did was just a quick and dirty thing to help manage CSV inputs – comma separated variables. Because that was never really done, and so now if you have the kind of simple CSV input…it will handle that fine as input.That’s basically all the development I’ve done.

And then he leaves for a discussion of how programs should be tested, calling the problem “difficult.”

And the cheers of the internet

It was fun to see the reactions. “The Unix legend, which owes us nothing, continues to fix fundamental AWK code,” reads a headline on Ars-Technica, while marveling at the text of an email Kernighan sent in May to Robbins instead of a longer git commit message). “Brian Kernighan said hello, asked how their visit to the United States was going, and dropped off hundreds of lines of code that could add Unicode support for AWK, the text analysis tool he helped build. created for Unix at Bell Labs in 1977.”

And their post drew 360 comments – more than one expressing relief that they’re not the only ones having problems with Git. “No one understands the git cli,” wrote one commenter. “Some people just memorize more commands than others.” (This comment got 249 upvotes.

So the geekery keeps coming. Later in the interview, Kernighan even says he’s had conversations with AWK’s other two original writers — both now in their ’80s — and editor Addison Wesley, about whether they should update. their 1988 book.

Kernighan quips that the new version would “deal with things like, ‘Well, now we can represent Unicode characters in at least a plausible way.'”

“But I think more generally the computing environment is incredibly different today than it was 35 or 40 years ago. Machines are, you know, a hundred to a thousand times faster. Memories are a million times bigger. And it changes the way you think about things.

“Before, you couldn’t afford to run AWK programs on large data, and now that’s not true. It deals with megabytes in milliseconds. And so that changes the trade-offs that you might make.

Kernighan also couldn’t help but notice how much our tools have changed since the first version of the book in 1988 – which was written during the heyday of the Unix document formatting tool troff.

At one point, Kernighan says he still has the original 1988 AWK book file — saved in the PostScript file format — which even predates .PDF.

Publish and engage

So what else has Kernighan been up to lately?

It turns out – a lot.

Brian Kernighan turned 80 in January — and he posts regularly, according to Kernighan’s webpage at Princeton University.

  • Last year, Kernighan published a new book exploring “the social, political and legal issues created by new technologies”. The book’s title : Understanding the digital world: what you need to know about computers, the internet, privacy and security.
  • In 2019, Kernighan also self-published a Kindle ebook titled Unix: a history and a memoirexploring not only the origins of Unix, but “how it came about and why it matters”.
  • In 2018, Kernighan also released Millions, billions, zillions: defending yourself in a world of too many numbers.which its website describes as “an essential survival guide for a world overwhelmed with big and often bad data.”
  • In 2015, Kernighan even co-wrote a book on the Go programming language for Addison-Wesley.

And of course, in 1978 Kernighan authored what has been called the world’s very first “Hello, world” program – and in 1988 Kernighan co-authored a book on the Awk programming language.

Kernighan’s interests are surprisingly eclectic. In 2020, Kernighan also co-authored an article on the real-life challenges of applying optical character recognition to 180,000 pages of court records from 1674 to 1913. The “Old Bailey Proceedings” – the official records of the Central Criminal Court of England and Wales – offered “an ideal benchmark” for testing the performance of optical character recognition on historical documents, their paper notes.

And since human transcriptions already exist for all 180,000 pages, they were able to use them to test the accuracy of the three major cloud-based OCR services: Textract from Amazon Web Services; Cognitive Services from Microsoft Azure and Vision from Google Cloud Platform.

“Our results revealed that AWS had the lowest median error rate, Azure had the lowest median round-trip time, and Google Cloud Platform had the best combination of low error rate and ‘a short duration.’

Since 2000, Kernighan has been on the faculty of computer science at Princeton University (where this spring taught a “digital humanities” course exploring how digital representations (and other technologies) are used for everything from literature, languages ​​and history to music, art and religion (“Digital humanities data is inherently messy,” explains the course description, “and there is always considerable effort devoted to cleaning it up. before the study can even begin”).

Kernighan’s class promises a seminar “aimed at creating tools and developing techniques that will help humanities scholars work more effectively with their data.” This could include machine learning, natural language processing, data visualization, data cleaning, and user interface design to make processes available to researchers who are new to the technology.

So maybe it was inevitable that Kernighan started thinking about Unicode characters…


Featured image via Wikipedia.

Share.

About Author

Comments are closed.