Problem 529

Summary: G4NuclearLevelManager: missing essential "custom" copy constructor, and sorting of containers of pointers
Product: Geant4 Reporter: novikova
Component: processes/hadronic/modelsAssignee: Hans-Peter.Wellisch
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 5.2   
Hardware: All   
OS: All   

Description novikova 2003-09-12 12:46:36 CEST
The bug I am about to describe has a long history.  Apparently, it first came
about when the package was switched from RogueWave to STL.  It evolved over
time, since the root of the problem was never fixed, only band-aid cure was
applied.

I first observed the bug in the version 4.1p01, since this is the version we
consider best suited for our needs of modeling Silicon and Germanium gamma-ray
Compton strip detectors.  Then I checked what happened to it in version 5.0p01,
since this is the version used by Dr. Fan Lei when he was writing his
radioactivity code, which we are starting to use.  Then I saw yet another
reincarnation of this bug in version 5.2p01, which is the "current version" as
of the moment of this writing.

I did a pretty thorough investigation of the bug, which is summarized in this
lengthy bug filing.  The reason I am writing all the details here is because I
sincerely hope that my notes will save somebody a whole lot of trouble in
chasing this bug, and, hopefully, some other bugs.  Please, please please, read
it to the end.

The code involved in the bug is located in objects G4NuclearLevelManager,
G4PtrLevelVector and G4ContinuumGammaTransition.  All this code is engaged when
the radioactivity code is working - it is used for handling of the excited
nucleus.

Here is what was happening in version 4.1. Again, never mind that this is an
old version. In my mind, it is important to know where things came from if we
want to fix them properly.

******************************************
****** Part I: version 4.1p01 ************
******************************************

After observing pretty weird results coming out of tests of radioactive decay,
I started my investigation. My investigation revealed the following.

G4NuclearDecayChannel::FillDaughterNucleus() is being called from
G4NuclearDecayChannel::G4NuclearDecayChannel().  Inside this function, the
G4NuclearLevelManager constructor is called:

  // Determine the excitation state corresponds to an actual level in the
  // photo-evaporation data.  Flag an error if the difference is too large.
  //
  if (theDaughterExcitation > 0.0) {
    G4NuclearLevelManager levelManager = G4NuclearLevelManager
      (daughterZ, daughterA);

This constructor looks as follows:

G4NuclearLevelManager::G4NuclearLevelManager(G4int Z, G4int A): _nucleusA(A),
_nucleusZ(Z)
{
  if (A <= 0 || Z <= 0 || Z > A )
    G4Exception("==== G4NuclearLevelManager ==== (Z,A) <0, or Z>A");
  _levels = 0;
  MakeLevels();
}

As you see, it is calling G4NuclearLevelManager::MakeLevels() function. This
function diligently reads the data off the disk (from the file z5.a11), and
fills out the _levels member variable, which is declared as G4PtrLevelVector*
_levels.  The important thing is that the class G4PtrLevelVector is defined as

typedef G4std::vector<G4NuclearLevel *> G4PtrLevelVector;

i.e., it is defined as a container of **pointers** to objects of the class
G4NuclearLevel, and not as a container of G4NuclearLevel objects themselves.
Why is it important? Because function G4NuclearLevelManager::MakeLevels(),
which fills out the member variable G4NuclearLevelManager._levels, performs a
sorting operation on the object _levels points to:

G4std::stable_sort(_levels->begin(), _levels->end());

This sorting operation is performed on the container of pointers, NOT on the
container of objects!  The sorting operation is a well known function from the
STL, and it needs to have the "<" predicate to be defined.  Although the
operator "<" IS defined for the objects of the G4NuclearLevel class, it is
never used (I did check this in the debugger).  What's happening instead,
the "<" operator on POINTERS G4NuclearLevel* is used.  IN simple words, the
sorting routine called from within G4NuclearLevelManager::MakeLevels() function
sorts the levels according to the address of the objects, and NOT according to
the energy level, as intended.

Obviously, when the table is sorted according to the memory addresses of the
items instead of their value, mess comes out.

The algorithm that deals with the _levels variable in the level manager assumes
that the table of levels is sorted in increasing order.  For example,
functions "G4double G4NuclearLevelManager::MinLevelEnergy()" and "G4double
G4NuclearLevelManager::MaxLevelEnergy()" are based on that assumption. Thus,
the sorting performed on the addresses of the items screws up all following
calculations.

Fortunately, this sorting operation was never needed, because the data was read
from the disk "in order,"  i.e., already sorted according to the energy
levels.  That is why the simple fix of the problem was to comment out the
sorting function.

[Note:  the whole bug would have never happened if the vector of G4NuclearLevel
was organized as a vector of objects, and not pointers to objects. Or if there
is a life-or-death need for it to be a container of pointers, then the
operator "<" for the POINTERS to G4NuclearLevel should have been redefined.]

******************************************
****** Part II: version 5.0p01 ***********
******************************************

The situation I described so far takes place in version 4.1p01, with happens to
be the one we here use most.  The version 5.0p01 has the line with the sorting
function commented out in the function G4NuclearLevelManager::MakeLevels().
This takes care of this part of the problem.

Unfortunately, this is not the whole problem.  The fact that the container of
G4PtrLevelVector is a container of pointers didn't go away, and thus, any sort
operation (with the default sorting predicate) performed on it has a potential
to screw up code.  Unfortunately, this sorting operation is called in the copy
constructor G4NuclearLevelManager::G4NuclearLevelManager(const
G4NuclearLevelManager &right).

[Note on the side that a copy constructor has to make a diligent god-honest
copy of the object, and should never sort anything, according to good
programming practices.  A VERY good book, and a very easy-to-read book on good
programming practices of C++ is "Effective C++ (second edition) 50 Specific
Ways to Improve Your Programs and Designs" by Scott Meyers.  There are probably
newer editions out by now, and there are other books by the same author (on the
same subject) - all of them excellent.]

This sorting operation (it is stable_sort in 4.1, and just sort in 5.0) is
present in both versions 4.1 and 5.0.  I did check, this copy constructor IS
being called during program operation, and the place that calls it is the
constructor G4ContinuumGammaTransition::G4ContinuumGammaTransition().  Here is
the beginning of the code of this constructor:

G4ContinuumGammaTransition::G4ContinuumGammaTransition(
                            const G4NuclearLevelManager& levelManager,
			    G4int Z, G4int A,
			    G4double excitation,
			    G4int verbose):
  _nucleusA(A), _nucleusZ(Z), _excitation(excitation), _levelManager
(levelManager)
{
  const G4PtrLevelVector* levels = levelManager.GetLevels();
  G4double eTolerance = 0.;
  if (levels != 0)
  {
    G4int lastButOne = levelManager.NumberOfLevels() - 2;
    if (lastButOne >= 0)
    {
      eTolerance = levelManager.MaxLevelEnergy() - levels->operator[]
(lastButOne)->Energy();
      if (eTolerance < 0.) eTolerance = 0.;
    }
  }

Note two things here.
1. The copy constructor (of the G4NuclearLevelManager class) is called when the
member variable _levelManager is initialized using the parameter levelManager
as the right hand side.  This means that _levelManager (the variable WITH the
underscore) is screwed up, since the sorting function was called during its
creation.  The parameter levelManager (i.e., the variable WITHOUT underscore)
is a "good one,"  since no sorting function was called upon it.
2. Inside the constructor G4ContinuumGammaTransition::G4ContinuumGammaTransition
() - see above - the following thing happens. Instead of using the member
variable that was just initialized, i.e. _levelManager, the parameter variable
levelManager (without underscore) is used in the first several lines.

[Side NOTE: this is against good programming practices!].

Lucky for the authors of the code, this little error (not following good
programming practices of using the member variables after their initialization,
and not the parameters passed in) saves the code!  Remember, the variable
_levelManager is "bad" (since it was sorted), the variable levelManager
is "good."  In the eights line of the code of the constructor, the function
MaxLevelEnergy() is called. This function, as I already mentioned above, relies
on the fact that the items in the levels table are sorted in increasing order.
By calling the MaxLevelEnergy() function (see the code of the constructor shown
above) on the levelManager instead of _levelManager, they get away with having
a WRONG member variable!

Thus, bad results are avoided here, but avoided by pure luck, not good
programming. In fact, one little error (using the parameter instead of the
member variable) let the authors get away with a much bigger error (calling
sorting function in the copy constructor of G4NuclearLevelManager).

That's not the end of the story.  I still needed to check whether the variable
_levelManager (the bad one) is used anywhere in the code of
G4ContinuumGammaTransition class. There is one more such place: function
G4ContinuumGammaTransition::SelectGamma().  Fortunately for the authors, the
function called on here is _levelManager.NearestLevel().  The NearestLevel()
code is not using the assumed fact that the levels table is sorted - it simply
goes through the whole table and looks for the nearest value of energy level.
[Another side note - IF the table were really sorted, many other, much faster
algorithms to find the nearest value could have been used.]

Thus, again, by pure luck, the disaster is avoided once again:  even though
the "bad" variable _levelManager is used, it is used in such a way that the
fact that the table of levels is not properly sorted is irrelevant.

So, here.  Twice the disaster was avoided by pure luck.  The whole thing was
started by the wrong (in my mind), design:  using a container of pointers where
the container of objects would do just fine.  The real bug was introduced by
using the sorting function on the container of pointers.  This bug was
introduced twice, and the most harmful entry was commented out in version 5.0.
The second entry is still in the code, and has potential of producing
disaster.  I traced the effect of the presence of the sorting function into the
class G4ContinuumGammaTransition, and it doesn't do any harm there (mostly, by
pure luck).  However, if the copy constructor of G4NuclearLevelManager (the
place where the second entry of the sorting function is present) is called from
somewhere else in the code, the disaster may happen very easily.  It will be a
hard bug to chase, since the results depend on memory allocations (after all,
sorting of memory addresses is what's happening during the invocation of the
infamous sorting function!), and will differ from computer to computer, from
operating system to operating system, from compiler to compiler.

******************************************
****** Part III: version 5.2p01 **********
******************************************

The bug was transformed (not fixed) yet again in version 5.2p01.

I found the following message (comment) in the code of G4NuclearLevelManager.hh:

//        02 May 2003,   Vladimir Ivanchenko remove rublic copy constructor

So the offending copy constructor - the one that had the "sort" in it - was
removed. I checked - it is indeed removed by commenting out.  Thus, the
compiler will create the default copy constructor for this object, which will
call the copy constructors of all nontrivial member variables, and will make a
straight copy of trivial members.

Unfortunately, there is one *very dangerous* trivial member of the class:
G4PtrLevelVector* _levels

The old copy constructor, commented out by Ivantchenko, had the lines
	_levels = new G4PtrLevelVector;
	G4int n = right._levels->size();
	G4int i;
	for (i=0; i<n; i++)
	  {
	    _levels->push_back(new G4NuclearLevel(*(right._levels->operator[]
(i))));
	  }
These were wonderful lines: they assured that the NEW object of the type of
G4PtrLevelVector is created, and all the members of this vector (the pointers
to the nuclear levels) are copied.  There is no "custom" copy constructor in
version 5.2p01, there is only the one created by the compiler.  That
constructor is NOT going to make a NEW G4PtrLevelVector. That constructor is
going to copy the value of the pointer G4PtrLevelVector*, and that's it.  Thus,
the Original and the Copy of the object G4NuclearLevelManager will both have
the member pointer G4PtrLevelVector* pointing to one and the same object.  When
the Copy of G4NuclearLevelManager is deleted, the pointer G4PtrLevelVector*
_levels will be wiped out (see the destructor of the object
G4NuclearLevelManager). This means, that the pointer _levels **of the
Original** object G4NuclearLevelManager will be zeroed also.  Which, shall we
say, may lead to unintended consequences, such as dereferencing a zero pointer
later on in the code.

******************************************
****** Part IV: Bottom line **************
******************************************

Because the class G4PtrLevelVector was organized as a container of pointers and
not objects, the sorting functions in G4NuclearLevelManager were screwing it
up. This was version 4.1.  Then, in version 5.0, one of the offending "sorts"
(the one in  G4NuclearLevelManager::MakeLevels()) was commented out.  The other
one (in copy constructor) was left in.  In version 5.2, instead of commenting
out the second offending sort, the whole function that had it (the copy
constructor) was commented out.  As a result, a new BAD bug was introduced,
which potentially may lead to dereferencing a zero pointer.

My recommendations:

(1) If it is not feasible to rewrite the container of G4NuclearLevel* (i.e.
pointers) into a container of G4NuclearLevel (i.e., objects), then the
predicate ">" should be redefined for the pointers G4NuclearLevel*. This will
allow to use the sorting operations.  Of course, there is a band-aid method of
simply commenting out the offending "sort" functions, but this leaves us to the
possibility that one day the data format of the energy levels stored on hard
drive will change, and the proper ordering won't be preserved.  If such data
are read from the disk and used later on in the code - the bug will be very
hard to catch.  Thus, the ">" for G4NuclearLevel* should be redefined, and the
sorting operation in G4NuclearLevelManager::MakeLevels() turned on again.

(2) In the present design, the copy constructor of the G4NuclearLevelManager
should definitely be enabled, and it should be performing a god-honest copying
of the stuff from &right into "this."  No sorting operation should be there.
The copy constructor is essential to avoid having Siamese twins that are
present now in version 5.2p01:  the Copy-Object and the Original-Object.  Here
is what I mean.  After the default copy constructor, created by the compiler
(in version 5.2p01) did its job, every variable, except for
the "G4PtrLevelVector* _levels" member variable, of the G4NuclearLevelManager
class exits in completely different memory location in Copy-Object and in the
Original-Object.  If you change one of these member variables in Copy-Object,
the Original-Object "won't know" about this fact.  If you change something in
the Original-Object, the Copy-Object "won't know" about it.  This is true for
every member variable except for G4PtrLevelVector* _levels.  Since this is a
pointer, it will be pointing to the same location in both the Copy-Object and
the Original-Object.  If this pointer is deleted in one object, be it the Copy-
Object or the Original-Object, the other one **will** know about it.  Thus, we
have Siamese twins.  They have two separate independent copies of almost every
organ (i.e. member variable), but only one and the same pointer to the
container G4PtrLevelVector.
Two solutions are possible here.
In my mind, the BEST solution is to change "G4PtrLevelVector* _levels"
to "G4PtrLevelVector _levels."  There is no point in having a pointer here,
since every time the new object of the class G4NuclearLevelManager is created,
an operator "new" is called, and "delete" is called in the destructor.  If the
member variable is changed to from a pointer to an object, then there is no
need in the custom copy constructor - the one created by the compiler will work
just fine.
If there is a definite need to have G4PtrLevelVector member in a form of a
pointer, and not an object, then the "custom" copy-constructor should be
enabled.  However, this constructor should not engage into any sorting at all -
it should just make a carbon copy of the Original-Object.

(3)  Also, it would be good to change the constructor of
G4ContinuumGammaTransition to use the member variables of the class after they
are initiated, instead of using the outside parameter.  This is useful practice
to follow with every class, since it leads to a much faster identification of
bugs.

Many greetings,
respectfully
-- Elena Novikova.

P.S. I am a professional software developer with years and years of experience
in designing and implementing C++ code of mathematical nature (not GUI, but the
actual math). For the last year and a half my only job is to use Geant4 in
Monte-Carlo modeling of Compton gamma-ray detectors.  I'll be glad to help
anyone from the Geant4 team with this bug or any other bug filed by me (I filed
quite a few) or anyone else, as long as it is from the math part of the code
(not visualization or data bases).  I am in Washington DC, my phone number is
(202) 404 1482. E-mail: novikova@ssd5.nrl.navy.mil
Comment 1 Hans-Peter.Wellisch 2003-09-15 04:35:59 CEST
Dear Elena,

  thank you for the very detailed analyis. The change towards 5.0P1, I had
actually done myself, since the person (M.G.Pia) who was responsible for the
code had abandoned ship. Not having the time to redo the design of the
photon-evaporation code (which should be redone in more than this one respect) I
had done a code review to make sure it actually works, and verified 'black box'.

I'll look into Vladimir's latest changes. You are clearly right that this is
risky, and can result in undefined behaviour. If the error materalizes, it
should actually cause the occational crash.

Many greetings,

Hans-Peter.
Comment 2 Hans-Peter.Wellisch 2003-11-14 05:28:59 CET
will be fixed in 6.0. The sorting is done using a g4 stl extention for pointer
vectors...

Many greetings,

Hans-Peter.