| Summary: | G4NuclearLevelManager: missing essential "custom" copy constructor, and sorting of containers of pointers | ||
|---|---|---|---|
| Product: | Geant4 | Reporter: | novikova |
| Component: | processes/hadronic/models | Assignee: | Hans-Peter.Wellisch |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | ||
| Priority: | P2 | ||
| Version: | 5.2 | ||
| Hardware: | All | ||
| OS: | All | ||
Dear Elena, thank you for the very detailed analyis. The change towards 5.0P1, I had actually done myself, since the person (M.G.Pia) who was responsible for the code had abandoned ship. Not having the time to redo the design of the photon-evaporation code (which should be redone in more than this one respect) I had done a code review to make sure it actually works, and verified 'black box'. I'll look into Vladimir's latest changes. You are clearly right that this is risky, and can result in undefined behaviour. If the error materalizes, it should actually cause the occational crash. Many greetings, Hans-Peter. will be fixed in 6.0. The sorting is done using a g4 stl extention for pointer vectors... Many greetings, Hans-Peter. |
The bug I am about to describe has a long history. Apparently, it first came about when the package was switched from RogueWave to STL. It evolved over time, since the root of the problem was never fixed, only band-aid cure was applied. I first observed the bug in the version 4.1p01, since this is the version we consider best suited for our needs of modeling Silicon and Germanium gamma-ray Compton strip detectors. Then I checked what happened to it in version 5.0p01, since this is the version used by Dr. Fan Lei when he was writing his radioactivity code, which we are starting to use. Then I saw yet another reincarnation of this bug in version 5.2p01, which is the "current version" as of the moment of this writing. I did a pretty thorough investigation of the bug, which is summarized in this lengthy bug filing. The reason I am writing all the details here is because I sincerely hope that my notes will save somebody a whole lot of trouble in chasing this bug, and, hopefully, some other bugs. Please, please please, read it to the end. The code involved in the bug is located in objects G4NuclearLevelManager, G4PtrLevelVector and G4ContinuumGammaTransition. All this code is engaged when the radioactivity code is working - it is used for handling of the excited nucleus. Here is what was happening in version 4.1. Again, never mind that this is an old version. In my mind, it is important to know where things came from if we want to fix them properly. ****************************************** ****** Part I: version 4.1p01 ************ ****************************************** After observing pretty weird results coming out of tests of radioactive decay, I started my investigation. My investigation revealed the following. G4NuclearDecayChannel::FillDaughterNucleus() is being called from G4NuclearDecayChannel::G4NuclearDecayChannel(). Inside this function, the G4NuclearLevelManager constructor is called: // Determine the excitation state corresponds to an actual level in the // photo-evaporation data. Flag an error if the difference is too large. // if (theDaughterExcitation > 0.0) { G4NuclearLevelManager levelManager = G4NuclearLevelManager (daughterZ, daughterA); This constructor looks as follows: G4NuclearLevelManager::G4NuclearLevelManager(G4int Z, G4int A): _nucleusA(A), _nucleusZ(Z) { if (A <= 0 || Z <= 0 || Z > A ) G4Exception("==== G4NuclearLevelManager ==== (Z,A) <0, or Z>A"); _levels = 0; MakeLevels(); } As you see, it is calling G4NuclearLevelManager::MakeLevels() function. This function diligently reads the data off the disk (from the file z5.a11), and fills out the _levels member variable, which is declared as G4PtrLevelVector* _levels. The important thing is that the class G4PtrLevelVector is defined as typedef G4std::vector<G4NuclearLevel *> G4PtrLevelVector; i.e., it is defined as a container of **pointers** to objects of the class G4NuclearLevel, and not as a container of G4NuclearLevel objects themselves. Why is it important? Because function G4NuclearLevelManager::MakeLevels(), which fills out the member variable G4NuclearLevelManager._levels, performs a sorting operation on the object _levels points to: G4std::stable_sort(_levels->begin(), _levels->end()); This sorting operation is performed on the container of pointers, NOT on the container of objects! The sorting operation is a well known function from the STL, and it needs to have the "<" predicate to be defined. Although the operator "<" IS defined for the objects of the G4NuclearLevel class, it is never used (I did check this in the debugger). What's happening instead, the "<" operator on POINTERS G4NuclearLevel* is used. IN simple words, the sorting routine called from within G4NuclearLevelManager::MakeLevels() function sorts the levels according to the address of the objects, and NOT according to the energy level, as intended. Obviously, when the table is sorted according to the memory addresses of the items instead of their value, mess comes out. The algorithm that deals with the _levels variable in the level manager assumes that the table of levels is sorted in increasing order. For example, functions "G4double G4NuclearLevelManager::MinLevelEnergy()" and "G4double G4NuclearLevelManager::MaxLevelEnergy()" are based on that assumption. Thus, the sorting performed on the addresses of the items screws up all following calculations. Fortunately, this sorting operation was never needed, because the data was read from the disk "in order," i.e., already sorted according to the energy levels. That is why the simple fix of the problem was to comment out the sorting function. [Note: the whole bug would have never happened if the vector of G4NuclearLevel was organized as a vector of objects, and not pointers to objects. Or if there is a life-or-death need for it to be a container of pointers, then the operator "<" for the POINTERS to G4NuclearLevel should have been redefined.] ****************************************** ****** Part II: version 5.0p01 *********** ****************************************** The situation I described so far takes place in version 4.1p01, with happens to be the one we here use most. The version 5.0p01 has the line with the sorting function commented out in the function G4NuclearLevelManager::MakeLevels(). This takes care of this part of the problem. Unfortunately, this is not the whole problem. The fact that the container of G4PtrLevelVector is a container of pointers didn't go away, and thus, any sort operation (with the default sorting predicate) performed on it has a potential to screw up code. Unfortunately, this sorting operation is called in the copy constructor G4NuclearLevelManager::G4NuclearLevelManager(const G4NuclearLevelManager &right). [Note on the side that a copy constructor has to make a diligent god-honest copy of the object, and should never sort anything, according to good programming practices. A VERY good book, and a very easy-to-read book on good programming practices of C++ is "Effective C++ (second edition) 50 Specific Ways to Improve Your Programs and Designs" by Scott Meyers. There are probably newer editions out by now, and there are other books by the same author (on the same subject) - all of them excellent.] This sorting operation (it is stable_sort in 4.1, and just sort in 5.0) is present in both versions 4.1 and 5.0. I did check, this copy constructor IS being called during program operation, and the place that calls it is the constructor G4ContinuumGammaTransition::G4ContinuumGammaTransition(). Here is the beginning of the code of this constructor: G4ContinuumGammaTransition::G4ContinuumGammaTransition( const G4NuclearLevelManager& levelManager, G4int Z, G4int A, G4double excitation, G4int verbose): _nucleusA(A), _nucleusZ(Z), _excitation(excitation), _levelManager (levelManager) { const G4PtrLevelVector* levels = levelManager.GetLevels(); G4double eTolerance = 0.; if (levels != 0) { G4int lastButOne = levelManager.NumberOfLevels() - 2; if (lastButOne >= 0) { eTolerance = levelManager.MaxLevelEnergy() - levels->operator[] (lastButOne)->Energy(); if (eTolerance < 0.) eTolerance = 0.; } } Note two things here. 1. The copy constructor (of the G4NuclearLevelManager class) is called when the member variable _levelManager is initialized using the parameter levelManager as the right hand side. This means that _levelManager (the variable WITH the underscore) is screwed up, since the sorting function was called during its creation. The parameter levelManager (i.e., the variable WITHOUT underscore) is a "good one," since no sorting function was called upon it. 2. Inside the constructor G4ContinuumGammaTransition::G4ContinuumGammaTransition () - see above - the following thing happens. Instead of using the member variable that was just initialized, i.e. _levelManager, the parameter variable levelManager (without underscore) is used in the first several lines. [Side NOTE: this is against good programming practices!]. Lucky for the authors of the code, this little error (not following good programming practices of using the member variables after their initialization, and not the parameters passed in) saves the code! Remember, the variable _levelManager is "bad" (since it was sorted), the variable levelManager is "good." In the eights line of the code of the constructor, the function MaxLevelEnergy() is called. This function, as I already mentioned above, relies on the fact that the items in the levels table are sorted in increasing order. By calling the MaxLevelEnergy() function (see the code of the constructor shown above) on the levelManager instead of _levelManager, they get away with having a WRONG member variable! Thus, bad results are avoided here, but avoided by pure luck, not good programming. In fact, one little error (using the parameter instead of the member variable) let the authors get away with a much bigger error (calling sorting function in the copy constructor of G4NuclearLevelManager). That's not the end of the story. I still needed to check whether the variable _levelManager (the bad one) is used anywhere in the code of G4ContinuumGammaTransition class. There is one more such place: function G4ContinuumGammaTransition::SelectGamma(). Fortunately for the authors, the function called on here is _levelManager.NearestLevel(). The NearestLevel() code is not using the assumed fact that the levels table is sorted - it simply goes through the whole table and looks for the nearest value of energy level. [Another side note - IF the table were really sorted, many other, much faster algorithms to find the nearest value could have been used.] Thus, again, by pure luck, the disaster is avoided once again: even though the "bad" variable _levelManager is used, it is used in such a way that the fact that the table of levels is not properly sorted is irrelevant. So, here. Twice the disaster was avoided by pure luck. The whole thing was started by the wrong (in my mind), design: using a container of pointers where the container of objects would do just fine. The real bug was introduced by using the sorting function on the container of pointers. This bug was introduced twice, and the most harmful entry was commented out in version 5.0. The second entry is still in the code, and has potential of producing disaster. I traced the effect of the presence of the sorting function into the class G4ContinuumGammaTransition, and it doesn't do any harm there (mostly, by pure luck). However, if the copy constructor of G4NuclearLevelManager (the place where the second entry of the sorting function is present) is called from somewhere else in the code, the disaster may happen very easily. It will be a hard bug to chase, since the results depend on memory allocations (after all, sorting of memory addresses is what's happening during the invocation of the infamous sorting function!), and will differ from computer to computer, from operating system to operating system, from compiler to compiler. ****************************************** ****** Part III: version 5.2p01 ********** ****************************************** The bug was transformed (not fixed) yet again in version 5.2p01. I found the following message (comment) in the code of G4NuclearLevelManager.hh: // 02 May 2003, Vladimir Ivanchenko remove rublic copy constructor So the offending copy constructor - the one that had the "sort" in it - was removed. I checked - it is indeed removed by commenting out. Thus, the compiler will create the default copy constructor for this object, which will call the copy constructors of all nontrivial member variables, and will make a straight copy of trivial members. Unfortunately, there is one *very dangerous* trivial member of the class: G4PtrLevelVector* _levels The old copy constructor, commented out by Ivantchenko, had the lines _levels = new G4PtrLevelVector; G4int n = right._levels->size(); G4int i; for (i=0; i<n; i++) { _levels->push_back(new G4NuclearLevel(*(right._levels->operator[] (i)))); } These were wonderful lines: they assured that the NEW object of the type of G4PtrLevelVector is created, and all the members of this vector (the pointers to the nuclear levels) are copied. There is no "custom" copy constructor in version 5.2p01, there is only the one created by the compiler. That constructor is NOT going to make a NEW G4PtrLevelVector. That constructor is going to copy the value of the pointer G4PtrLevelVector*, and that's it. Thus, the Original and the Copy of the object G4NuclearLevelManager will both have the member pointer G4PtrLevelVector* pointing to one and the same object. When the Copy of G4NuclearLevelManager is deleted, the pointer G4PtrLevelVector* _levels will be wiped out (see the destructor of the object G4NuclearLevelManager). This means, that the pointer _levels **of the Original** object G4NuclearLevelManager will be zeroed also. Which, shall we say, may lead to unintended consequences, such as dereferencing a zero pointer later on in the code. ****************************************** ****** Part IV: Bottom line ************** ****************************************** Because the class G4PtrLevelVector was organized as a container of pointers and not objects, the sorting functions in G4NuclearLevelManager were screwing it up. This was version 4.1. Then, in version 5.0, one of the offending "sorts" (the one in G4NuclearLevelManager::MakeLevels()) was commented out. The other one (in copy constructor) was left in. In version 5.2, instead of commenting out the second offending sort, the whole function that had it (the copy constructor) was commented out. As a result, a new BAD bug was introduced, which potentially may lead to dereferencing a zero pointer. My recommendations: (1) If it is not feasible to rewrite the container of G4NuclearLevel* (i.e. pointers) into a container of G4NuclearLevel (i.e., objects), then the predicate ">" should be redefined for the pointers G4NuclearLevel*. This will allow to use the sorting operations. Of course, there is a band-aid method of simply commenting out the offending "sort" functions, but this leaves us to the possibility that one day the data format of the energy levels stored on hard drive will change, and the proper ordering won't be preserved. If such data are read from the disk and used later on in the code - the bug will be very hard to catch. Thus, the ">" for G4NuclearLevel* should be redefined, and the sorting operation in G4NuclearLevelManager::MakeLevels() turned on again. (2) In the present design, the copy constructor of the G4NuclearLevelManager should definitely be enabled, and it should be performing a god-honest copying of the stuff from &right into "this." No sorting operation should be there. The copy constructor is essential to avoid having Siamese twins that are present now in version 5.2p01: the Copy-Object and the Original-Object. Here is what I mean. After the default copy constructor, created by the compiler (in version 5.2p01) did its job, every variable, except for the "G4PtrLevelVector* _levels" member variable, of the G4NuclearLevelManager class exits in completely different memory location in Copy-Object and in the Original-Object. If you change one of these member variables in Copy-Object, the Original-Object "won't know" about this fact. If you change something in the Original-Object, the Copy-Object "won't know" about it. This is true for every member variable except for G4PtrLevelVector* _levels. Since this is a pointer, it will be pointing to the same location in both the Copy-Object and the Original-Object. If this pointer is deleted in one object, be it the Copy- Object or the Original-Object, the other one **will** know about it. Thus, we have Siamese twins. They have two separate independent copies of almost every organ (i.e. member variable), but only one and the same pointer to the container G4PtrLevelVector. Two solutions are possible here. In my mind, the BEST solution is to change "G4PtrLevelVector* _levels" to "G4PtrLevelVector _levels." There is no point in having a pointer here, since every time the new object of the class G4NuclearLevelManager is created, an operator "new" is called, and "delete" is called in the destructor. If the member variable is changed to from a pointer to an object, then there is no need in the custom copy constructor - the one created by the compiler will work just fine. If there is a definite need to have G4PtrLevelVector member in a form of a pointer, and not an object, then the "custom" copy-constructor should be enabled. However, this constructor should not engage into any sorting at all - it should just make a carbon copy of the Original-Object. (3) Also, it would be good to change the constructor of G4ContinuumGammaTransition to use the member variables of the class after they are initiated, instead of using the outside parameter. This is useful practice to follow with every class, since it leads to a much faster identification of bugs. Many greetings, respectfully -- Elena Novikova. P.S. I am a professional software developer with years and years of experience in designing and implementing C++ code of mathematical nature (not GUI, but the actual math). For the last year and a half my only job is to use Geant4 in Monte-Carlo modeling of Compton gamma-ray detectors. I'll be glad to help anyone from the Geant4 team with this bug or any other bug filed by me (I filed quite a few) or anyone else, as long as it is from the math part of the code (not visualization or data bases). I am in Washington DC, my phone number is (202) 404 1482. E-mail: novikova@ssd5.nrl.navy.mil