http://invisible-island.net/
Copyright © 1996-2019,2022 by Thomas E. Dickey


C_COUNT – C/C++ Line Counter

Synopsis

c_count counts lines, statements, other simple measures of C/C++ source programs. It isn't lex/yacc based, and is easily portable to a variety of systems.

History

I originally wrote c_count in mid-1983, calling it lincnt after an earlier metrics utility. The current name is easier to remember.

However, this copy dates to the end of 1985, because I had moved, and though I had it on tape, had no tape drive. So I entered it from a listing.

Design Issues

Background

In case someone wishes to remind me, I am already aware of various code-metrics. Early on (1977-1978), when I started to evolve the notion of a complexity measure for microprocessors, I had in mind the other side, e.g., development effort. I spent some time gathering numbers to show how much effort (time and steps) were needed to develop programs. What I found was

By the way, gathering the metrics took more time than developing the programs. While it might be possible to construct an environment which did the measurements, that was out of scope. For making spot-checks and assessments, a simple tool showing progress was needed. I did that for my project outside the scope of the research, starting in 1976.

To put it another way, tools that tell how high a mountain is are useful, even if there are various methods of climbing it which differ in cost.

Fancier tools (function points, McCabe, Halstead) all have their pitfalls. For instance, I computed data for Halstead's measure and found it strongly correlated with SLOC. This link is not talking about me (and the dates given are unlikely—Halstead published his work in 1977), but the conclusion does match mine.

Simple Measures

I wrote the 1983-version of lincnt to track progress on my project. I found that I was adding about 2000 lines per week. SLOC was then not part of my vocabulary.

To me, it was obvious that the way to count C was to count the semicolons used for delimiters. True, that gives two for a for-loop. But that is a minor inconsistency. And it is simple.

My associates argued about that (it was not obvious to them), and were leery that management might use that as a measure of our performance. (That was not obvious to me, but I concede it could be a hazard).

A little later (same program, early 1985), I encountered comments by someone talking about his "great programmer". That struck me as odd, since I had read some of that person's work, and was unimpressed. Just to check, I started by running my code-counter. It surprised me, saying that the program was about 40% comments (hinting that the programmer was doing thorough work). Going back to inspect the program, I realized that my code-counter was misled. Most of the comments were asterisk characters. After adjusting the count to exclude punctuation, the counter showed less than 3% comments. I refined that measure to compare comments to code (ignoring whitespace—and of course punctuation within "comments").

That is a simple measurement, which gives me a figure of merit for a program. For the same programmer, there are interesting stylistic flaws which would probably require a complex measurement. For example, the program which I was reading used preprocessor macros ineffectively. It defined a report's columns as a set of constants, but did not use arithmetic expressions. That detracted from its maintainability: if one wished to change the width of a column, that would require changing all of the #define's for the columns after the altered column. On reflection, that 3% comment:code ratio told me enough about that program.

There are other simple measures which help to gauge code quality. In a different analysis, I was interested in how much of a program was simply pasted in multiple places rather than by constructing suitable functions. The motivation was because I was working to undo this (calling it dump-truck code) for a program which was in two parts that should have shared data. I analyzed this by stripping comments and extra whitespace, sorting the lines and measuring the number of duplicate lines. In my project I had reduced the duplication from about 30% to less than 20%. On the other hand, another program in the project (not mine) had 46% duplication.

Adding Features

The newer version (starting at the end of 1985) evolved over several years, as I found new issues to deal with The change-log by the way shows the first check-in for March 1986. That was using the SCCS wrappers which I wrote to support the project I was working on.

Part of that project was developing and maintaining a Unix kernel driver for a networking card. The person who had started the driver had written macros with strings that lacked the terminating quote. I added an option to make c_count deal with that, rather than always accept the odd syntax. (This feature does not work with standard C).

I added some features based on suggestions by others. Most of those were in a later project (starting at the end of 1987):

Shortly after, I added SLOC to my vocabulary, along with PSS (physical source lines) and LSS (logical source lines). We had some people doing metrics, and they had their own language. I encountered this while developing a.count. Like most metrics people, these did not write programs. Rather, they made models (such as an S-curve) and occasionally collected data to validate the models. I wrote a.count to satisfy my curiosity about the project that I was working on. They learned about the program, and after much discussion requested that I modify the report, changing

Not only that, but they requested that I do the same for lincnt (as it was then called). I did that, but made it optional (-j, for "jargon"). Doing that made my code-counters part of the establishment, so to speak, and they referred to the programs in the papers they were writing.

Publishing...

I renamed the program a few years later (May, 1995), having left that project, and starting to publish the programs I had written on my own during the previous decade. This was around the time that the comp.sources.misc newsgroup died, as I see in my email:

From dickey Wed Jul 12 06:13:07 1995
Subject: recent postings
To: comp-sources-misc@uunet.uu.net, sources-misc@uunet.uu.net,
        comp-sources-unix@uunet.uu.net (comp.sources.unix)
Date: Wed, 12 Jul 1995 06:13:07 -0400 (EDT)

Are you guys still there?  I sent a copy of

        diffstat 1.7 comp.sources.misc (may 21, 1995)
        c_count 7.0 comp.sources.misc (may 21, 1995)

and corrected up with a message to comp.sources.unix indicating that diffstat
should be in _that_ group. Aside from the auto-reply from comp.sources.unix,
I've seen no response.

-- 
Thomas E. Dickey
dickey@clark.net
  

While diffstat showed up in the index for comp.sources.unix (volume 28, ending May 23, 1995), that was the 42nd of 58 entries in this volume, c_count did not show up in either. For what it's worth, here is a list of successful postings for programs that I worked on during that era:

At the same time, I put a copy on Sunsite.

Good Numbers

A nonobvious aspect of counting C source is what to do about inline comments. For example, in this chunk:

/* set up a buffer for this file */
bp = getfile2bp(param, FALSE, TRUE);
if (bp) {
    bp->b_flag |= BFARGS;   /* treat this as an argument */
    make_current(bp);       /* pull it to the front */
    if (!havebp) {
        havebp = bp;
        havename = param;
    }
}

there are two inline comments. Some counters ignore them, some do not. c_count does both. In showing the sum of line-types to 100%, it counts inline comments as a negative value, since those lines are already counted as code:

    10     5   |/tmp/foo.c
----------------
    10     5    total lines/statements

     3  lines had comments        30.0 %
     2  comments are inline      -20.0 %
     0  lines were blank           0.0 %
     0  lines for preprocessor     0.0 %
     9  lines containing code     90.0 %
    10  total lines              100.0 %

    60  comment-chars             22.1 %
    12  nontext-comment-chars      4.4 %
    86  whitespace-chars          31.6 %
     0  preprocessor-chars         0.0 %
   114  statement-chars           41.9 %
   272  total characters         100.0 %

    18  tokens, average length 4.83

  0.53  ratio of comment:code

     2  top-level blocks/statements
     3  maximum blocklevel
  1.89  ratio of blocklevel:code

Doing it this way accounts for all of the categories. Incidentally, the format is chosen so that roundoff is accounted for. The numbers are supposed to add up exactly. When I was developing this around 1990, I used both Sun and Apollo workstations. The latter required adjustment, since it rounded differently from Sun. It turns out that rounding problems are far less common with standard C.

What Next?

There are other interesting measures that I could add to c_count. Or I could develop a different tool.

In later metrics work, I have developed different tools. For example, in 2005 I developed two different tools, but (as in c_count and a.count) kept the same general reporting style:

Both of those dealt with a dozen or so file-types.

The latter was based on the syntax-highlighters which I have developed for vile (vi-like-emacs). Generalizing from code-counting, this tool also made the same measurements for a few data file-types such as HTML and XML.

One drawback to the way in which I developed it was that I could not reuse syntax highlighters easily enough. If I were to revisit this tool, I would use vile directly by parsing the colorized output from the syntax highlighters. I added an option (-F) to vile at the end of 2009 which makes this simple.

Not Useful...

By the way, I have noticed sloccount of course, but have no use for it:

For instance, this reported on things that I maintain. It shows a lot of differences. It also overlooked things like the M4-macros for autoconf (but likely found the M4 sources which made up 3719 lines of ncurses' Ada95 binding). In the comparison below, I have omitted the autoconf-generated "configure" script and the utilities "config.guess" and "config.sub". Also (since sloccount ignored those files), I have omitted counts for the various ".in" templates.

Note: c_count of course counts only C programs, and directly shows LSS (logical source statements). I wrote a script called lex-metrics which uses the -F option of vile to compute SLOCs for the various files.

Program Actual lines SlocCount lines
ada ansic awk cpp lex perl sed sh yacc ada ansic awk cpp lex perl sed sh yacc
cproto-4.6   7600     766     279 761   7600     985     279 761
dialog-0.9a-20001217   5419       350   2316     5321       350   3145  
diffstat-1.27   615           170     616              
lynx2-8-4   118534       583 107 231     116438       583   206  
ncurses-5.2 11354 47489 606 3727   124 137 1583   12937 48144 552 3726   126 136 2323  

The license information shown in the report also is misleading (unsurprising given the source). The MIT-X11 license is listed as "distributable".

The filename conflict is like the other problems noted. It is customary when designing a program to avoid conflict with existing programs. c_count had been on the main Linux ftp server (sunsite.unc.edu) for six years before sloccount was released in 2001.

Changes

See the changelog for details:

Documentation


Download

There are other metrics programs, of course.

This includes programs named ccount. The one that I referred to from 1988 is not freely available, so I will not cite it here.