A Perl Mystery

I’m trying to write a script to consolidate my book index, and I’ve run into a problem that’s driving me nuts. Can anyone see what’s going on here?

Here’s the relevant code:

if ($pagenumber) { #6
$numbers = $_;
$i = 0;

foreach (@numberarray) {
$numberarray[$i] = “”;
$i = $i + 1;
} # End foreach
$i = 0;

foreach (@sorted_numbers) {
$sorted_numbers[$i] = “”;
$i = $i + 1;
} # End foreach
$i = 0;

print $DEBUG2 “\$numbers is $numbers before entering while.\n”;

while ($numbers =~ /(\d+)(.*)/) {
print $DEBUG2 “\$1 is |$1|, \$pagenumber is |$pagenumber|.\n”;
if (!($1 eq $pagenumber)) {
$numberarray[$i] = $1;
$i = $i + 1;
} # End if
print $DEBUG2 “In while loop, \$numberarray[$i] is $numberarray[$i].\n”;
$numbers = $2;
print $DEBUG2 “Point 10: \$numbers is $numbers, \$i is |$i|.\n”;
} # End while
$numberarray[$i] = $pagenumber;
print $DEBUG2 “After while loop, \$numberarray[$i] is |$numberarray[$i]|, \$sorted_numbers[$i] is |$sorted_numbers[$i]|\n”;
@sorted_numbers = sort { $a <=> $b } @numberarray;
@numberarray = @sorted_numbers;
print $DEBUG2 “After sort, \$sorted_numbers[$i] is |$sorted_numbers[$i]|.\n”;

print $DEBUG2 “After sort \@numberarray is @numberarray, \$numberarray[$i] is $numberarray[$i].\n”;
$i = 0;
print $DEBUG2 “About to enter foreach (\@sorted_numbers).\n”;
$next = $i + 1; #Just for diagnostics
print $DEBUG2 “Before foreach, \$totalname is |$totalname|, \$numberarray[$i] is |$numberarray[$next]|, \$numberarray[$next] is |$numberarray[$i+1]|.\n”;
foreach (@numberarray) {
print $DEBUG2 “Got inside the loop.\n”;
print $DEBUG2 “**\$totalname is $totalname for \$i = $i.\n”;
$totalname = $totalname . ” ” . “$numberarray[$i],”;
$i = $i + 1;
}
print $DEBUG2 “Before chop condition, \$totalname is $totalname.\n”;
if ($totalname =~ /.*\,$/) {chop $totalname}
print $DEBUG2 “After chop condition, \$totalname is $totalname.\n”;

And here’s the debug output, for two different cases. One works, the other doesn’t and I can’t figure out what’s happening, but whatever it is, it seems to be happening in the sort. They should both give similar output — the name and a single page number, but as you can see, they don’t.

CASE 1

$numbers is Michael Adams 19, before entering while.
$1 is |19|, $pagenumber is |19|.
In while loop, $numberarray[0] is .
Point 10: $numbers is ,, $i is |0|.
After while loop, $numberarray[0] is |19|, $sorted_numbers[0] is ||
$sorted_numbers[0] is |19|.
after exiting @numberarray is 19, $numberarray[0] is 19.
About to enter foreach (@sorted_numbers).
Before foreach, $totalname is |Adams, Michael|, $numberarray[0] is ||, $numberarray[1] is ||.
Got inside the loop.
**$totalname is Adams, Michael for $i = 0.
Before chop condition, $totalname is Adams, Michael 19,.
After chop condition, $totalname is Adams, Michael 19.

CASE 2

$numbers is Bill Anders 26, before entering while.
$1 is |26|, $pagenumber is |26|.
In while loop, $numberarray[0] is .
Point 10: $numbers is ,, $i is |0|.
After while loop, $numberarray[0] is |26|, $sorted_numbers[0] is ||
$sorted_numbers[0] is ||.
after exiting @numberarray is 26, $numberarray[0] is .
About to enter foreach (@sorted_numbers).
Before foreach, $totalname is |Anders, Bill|, $numberarray[0] is ||, $numberarray[1] is ||.
Got inside the loop.
**$totalname is Anders, Bill for $i = 0.
Got inside the loop.
**$totalname is Anders, Bill , for $i = 1.
Got inside the loop.
**$totalname is Anders, Bill , , for $i = 2.
Before chop condition, $totalname is Anders, Bill , , 26,.
After chop condition, $totalname is Anders, Bill , , 26.

22 thoughts on “A Perl Mystery”

  1. I think what you’re trying to do here is take an input string $numbers which is a bunch of page numbers separated by nonnumeric characters, then extract the page numbers, sort them, and then list them in order after the subject heading which is originally in $totalname.

    Your code snippet is not quite the same version as used to produce the outputs–e.g., the code prints two “After sort” lines, not seen in the two cases you give. So I’m not sure of this answer. But I think what’s probably going on is that @numberarray is getting filled up with empty string values by your array-clearing loops (the first two foreach (…) loops near the top of your code), which your sort places before all actual page numbers. This could happen if e.g. this code is in a loop over all index entries and your “CASE 2” is following an entry with several page numbers. Then you end up with a @numberarray something like ( 26, “”, “” ); sorting this gives ( “”, “”, 26 ), and after you concatenate all of the page numbers you end up with those extra leading commas.

    I would replace those two clearing loops with just
    @numberarray = ();
    (you don’t need to clear @sorted_numbers since it’s just being reassigned later).

    There are much more Perl-y ways of doing this with split and join, involving a lot fewer explicit loops. Here’s something that might not quite work (I don’t know what your actual input specification is) but is close:

    if ($pagenumber) { #6
    # extract all digit substrings (by splitting on all nonempty nondigit substrings)
    @numberarray = split /\D+/, $_;

    # the first element will be “” if $_ starts with a nondigit; drop this element
    shift @numberarray if $numberarray[0] == “”;

    # sort the elements numerically
    @numberarray = sort { $a $b } @numberarray;

    # put them in a comma-separated list, following the string in $totalname
    $totalname .= ” ” . join( “, “, @numberarray);

    print “\$totalname = $totalname\n”;
    }

    If you describe more precisely what you expect to be in $pagenumber, $_, and $totalname (or whatever the inputs are) at the top of this code segment I can probably help more.

    1. I think what you’re trying to do here is take an input string $numbers which is a bunch of page numbers separated by nonnumeric characters, then extract the page numbers, sort them, and then list them in order after the subject heading which is originally in $totalname.

      Right.

      Your code snippet is not quite the same version as used to produce the outputs–e.g., the code prints two “After sort” lines, not seen in the two cases you give. So I’m not sure of this answer.

      The only change, AFAIK, was to delete a diagnostic printout, but the point is taken.

      The input for the outer loop is “Lastname, Firstname 1” (or any page number).
      The input for the inner loop might be something like “Firstname Lastname 3, 4, 67, 80”

      Desired output would be “Lastname, Firstname 1, 3, 4, 67, 80”

  2. Did you run the two cases independently, or one after another with the same state? I’m wondering if your array initializations are really doing what you want – an array with a few extra “” at the end might sort those to the front, causing the confusion.

  3. Missing a close brace at the top:

    if ($pagenumber) { #6
    $numbers = $_;
    $i = 0;

    Might I suggest indentation? Helps make the code more readable.

    1. Having posted code snippets before, I don’t think most blog software is very indentation friendly and it often even strips redundant white space.

    2. I do indent the code, just not the diagnostic printing statements, which will come out eventually. But you’re right, it would make it easier to read.

      1. As a possible quick solution, the <pre> tag preserves whitespace, and usually renders the result in a monospaced font, unless WordPress is somehow monkeying with it.

  4. After studying your code line by line, I think I see what’s going on. You’re trying to write code to do what TExtract, among other commercial products, already do.

    Why???

  5. As someone who likes Python, I would say the first bug is: you are using Perl. 🙂

    I’m surprised you haven’t gotten shift-arthritis yet (the inflammation you get in your joints having to constantly type special characters).

    I kid, I kid.

    1. I enjoy studying different languages, probably because I got an early start with APL back in the late 70s. I’m still looking for the right language and have concluded I’d have to write it myself some day. Many others have come to this same conclusion which is why their are probably more computer languages than spoken.

      1000 years from now they will look back and say, “they should have just thrown a bunch of symbols against the wall” the way monkeys used to do modern art.

      Code should be like reading a book. You should be able to look at it ten years later and understand it as the first day. This means not doing the things programmers love to do, getting tricky. Terse is not the wonderful thing it seems to non typists.

      Maintenance is your highest cost. I’m too dumb to be a perl programmer (the epitome of the tricky language) but I can maintain and upgrade millions of lines of code all by my lonesome. I tried, but can’t help with perl. Perl is not the worst. I used to work for a guy in NY that over wrote his own BAL code (self modifying.) Try maintaining something that isn’t there!

      C starts out elegant, then they screw it up with the libraries. Programmers need to keep a good grade school English dictionary on hand. Malloc is not a word.

      1. There’s a joke I’ve heard that says that Perl is the only language that looks the same encrypted as it does in plaintext.

        1. Found it:

          Perl: The only language that looks the same before and after RSA encryption

  6. Ex-C programmer, Rand? Or (shudder) Fortran?

    I think you might be able to replace a lot of those lines with:

    my @numberarray = sort {$a <=> $b} ($numbers =~ /(\d+)/g);

    Reading right-to-left: greedily extract all digit substrings from “$numbers”, sort into ascending numeric order, store in “@numberarray”. Kind of what you wanted?

    1. No, actually, my first language was Algol (I later preferred Pascal). And I do FORTRAN, BASIC and other non-structured languages only under duress. The problem is that I don’t do much programming at all these days, and only do Perl every few years or so, so I always have to relearn it. And this was supposed to be quick and dirty, but turned out to be not-so-quick and dirty. I should have flow charted it and broken up into subroutines.

Comments are closed.