For Programmers: Free Programming Magazines  


Home > Archive > Java Help > February 2006 > Regex questions suggestions.









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Regex questions suggestions.
JoshRountree@gmail.com

2006-02-21, 7:03 pm

I would like to go through a java source file and determine how many
lines of code there are, and how many lines of comments there are,
throwing out all the whitespace lines. I'm having a tough time getting
my regular expressions right. Any help would be greatly appreciated.


import java.io.*;
import java.util.regex.*;

public final class RegexTest {

private static String SLASHSLASHREGEX = "[\\s]*\\A(//)[.]*";
private static String SLASHSTARREGEX = "\\s*^/\\Q*\\E[.]*";
private static String STARSLASHREGEX = "[.]*\\Q*\\E/\\s*";
private static String WHITESPACEREGEX = "\\s*[^.]+";
private static String CODEREGEX = "\\s*[^\\s]+";

private static String input;
private static BufferedReader br;

private static Pattern slashSlashPattern;
private static Pattern slashStarPattern;
private static Pattern starSlashPattern;
private static Pattern codePattern;
private static Pattern whiteSpacePattern;

private static Matcher slashSlashMatcher;
private static Matcher slashStarMatcher;
private static Matcher starSlashMatcher;
private static Matcher codeMatcher;
private static Matcher whiteSpaceMatcher;

private static boolean inComment;
private static boolean slashStarFound;

private static int numLinesCode;
private static int numLinesComment;
private static int numLinesWhiteSpace;

public static void main(String[] args) {
initResources(args[0]);
processFile();
}

private static void processFile() {
try {
while ((input = br.readLine()) != null) {
slashSlashMatcher = slashSlashPattern.matcher(input);
slashStarMatcher = slashStarPattern.matcher(input);
starSlashMatcher = starSlashPattern.matcher(input);
codeMatcher = codePattern.matcher(input);
whiteSpaceMatcher = whiteSpacePattern.matcher(input);

if (slashSlashMatcher.find()) {
System.out.println("Slash Slash " + input);
numLinesComment++;
}
else if (slashStarMatcher.find()) {
System.out.println("Slash Star " + input);
numLinesComment++;
inComment = true;
slashStarFound = true;
}
else if (starSlashMatcher.find()) {
if (inComment) {
System.out.println("Star Slash " + input);
numLinesComment++;
inComment = false;
}
else {
System.out.println("Code " + input);
numLinesCode++;
}
slashStarFound = false;
}
else if (codeMatcher.find()) {
if (inComment) {
System.out.println("Comment");
numLinesComment++;
}
else {
System.out.println("Code " + input);
numLinesCode++;
}
}
else {
System.out.println("White Space");
numLinesWhiteSpace++;
}
}

System.out.println("Number of lines of code: " + numLinesCode);
System.out.println("Number of lines of comment: " +
numLinesComment);
System.out.println("Number of lines of white space: " +
numLinesWhiteSpace);
} catch (IOException exp) {

}
}

private static void initResources(String inputFileName) {
try {
br = new BufferedReader(new FileReader(inputFileName));
} catch (FileNotFoundException fnfe) {
System.out.println("Cannot locate input file!
"+fnfe.getMessage());
System.exit(0);
}

slashSlashPattern = Pattern.compile(SLASHSLASHREGEX);
slashStarPattern = Pattern.compile(SLASHSTARREGEX);
starSlashPattern = Pattern.compile(STARSLASHREGEX);
codePattern = Pattern.compile(CODEREGEX);
whiteSpacePattern = Pattern.compile(WHITESPACEREGEX);

numLinesCode = 0;
numLinesComment = 0;
numLinesWhiteSpace = 0;
inComment = false;
slashStarFound = false;
}
}

opalpa@gmail.com opalinski from opalpaweb

2006-02-21, 7:03 pm

Mastering Regular Expressions by Friedl has 8 pages devoted to parsing
out C comments. C comments closely resemble Java comments.

Also, a side note, I believe SEI (Software Engineering Institute) has
tools that perform line counting by a couple of different standards.
They have a measure called "effective line count" which takes account
of different styles.

I personally write condensed code -- which results in undercounting. I
also, however, write programs that produce source code, which results
in overcounting. The fact that there are no measures of productivity
more sophisticated than line count shows how young our field, the field
of programming, is.

Opalinski
opalpa@gmail.com
http://www.geocities.com/opalpaweb/

Hendrik Maryns

2006-02-22, 7:59 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
NotDashEscaped: You need GnuPG to verify this message

JoshRountree@gmail.com schreef:
> I would like to go through a java source file and determine how many
> lines of code there are, and how many lines of comments there are,
> throwing out all the whitespace lines. I'm having a tough time getting
> my regular expressions right. Any help would be greatly appreciated.
>
>
> import java.io.*;
> import java.util.regex.*;
>
> public final class RegexTest {
>
> private static String SLASHSLASHREGEX = "[\\s]*\\A(//)[.]*";


This will only match if there is no whitespace in front, because \A
looks for the beginning of the string. It probably won?t match anything
at all, if I get the meaning of \A right. And why do you put brackets
around the .? Just .* is enough. No need for the parentheses either,
as you do nothing with the captured group. My guess (untested, try for
yourself): "\\s*//.*". Then again, you are only interested in the
beginning of the line, right? Then "\\s*//" is enough, doesn?t care
what comes after the //.
This will also catch a line like
private String myString; //use for X
which you probably don?t want. So than you should prepend ^ before it.

> private static String SLASHSTARREGEX = "\\s*^/\\Q*\\E[.]*";


You?re making life difficult on yourself again. ^ looks for start of
line, so the \s is useless. Just leave ^ out. And you can just quote
the star, no need for constructs like \Q and \E. Leave out the
brackets. "\\s*/\\*.*" Same thing here, probably "\\s*/\\*" is enough.
Same comment as before regarding ^.

> private static String STARSLASHREGEX = "[.]*\\Q*\\E/\\s*";


You should be getting the clue now: ".*\\*/". I think you need a lazy
quantifier here, though, would have to read up on that. That would be
".*?\\*/".

> private static String WHITESPACEREGEX = "\\s*[^.]+";


Now this one is correct, except that it can go wrong if the line
consists only of a newline, see the comment in the javadoc regarding .
and newlines. But of course, it is needlessly complicated. "^\\s$" is
the way to go.

> private static String CODEREGEX = "\\s*[^\\s]+";


This will match just about anything except for lines that contain only
whitespace, so also comments. As such, I don?t see a need for it. You
test on all other cases anyway. Then again, as you don?t test for
beginning of line, you might as well do "[^\\s]+".

Please correct me if I?m wrong, all untested.

H.
--
Hendrik Maryns

==================
www.lieverleven.be
http://aouw.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFD/DUKe+7xMGD3itQRAu2mAJ4pZF47BIsK5/9UdzW3eWhC/BBWNgCdE1xZ
7iGwEjPMy+lmU369eGoZrzA=
=fDUm
-----END PGP SIGNATURE-----
Roedy Green

2006-02-23, 3:57 am

On 21 Feb 2006 14:00:02 -0800, "opalpa@gmail.com opalinski from
opalpaweb" <opalpa@gmail.com> wrote, quoted or indirectly quoted
someone who said :

>Mastering Regular Expressions by Friedl has 8 pages devoted to parsing
>out C comments. C comments closely resemble Java comments.


I have written an number of bits of ordinary Java code that deal with
/* /** // comments also "...". They are very quick, and fairly
straightforward to code with a little finite state automaton.

You have your states.

various non comment states
seen /
seen /*
seen //
seen /* *
etc

You then examine he next char and depending what state you are in, you
choose the next state.

I find this easiest to write with the following structure:

1. a char categoriser. Then your states only need to deal with
character categories e.g. slash star eol space letters digits quote
tick otherpunctuaton
..
2. an enum with a value for each state.

3. a custom next method for enum state that can look at the next
char/category and returns what state to go into next.

You sometimes add a little lookahead ability to simplify.

I find this sort of code very easy to write and debug. Once you have
your states clearly defined, you only have to focus on one tiny slice
of the problem. If I am in this state and THIS comes next what state
comes next? You proofread by comparing similar states.

I get a child-like delight watching these things churn away when I
trace them to debug them.

Regexes are such frustrating things in comparison. Granted they are
terse, but when they don't work, you have no clue why.

Then there are parser generators. These are easier than you might
suppose, but I still find myself going back to hand coding my parsers
because:

1. I am not just parsing, I am DOING things. It is clumsy to add your
own code to a custom generated parser.

2. I am usually dealing with imperfect syntax. I want to soldier on
anyway, not just throw up my hands in disgust the way a parser
typically does then it encounters a syntax error.

3. they an order of magnitude slower than my hand-coded ones.

See http://mindprod.com/jgloss/parser.html

http://mindprod.com/finitestate.html


--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com