GIS3 CPU trouble
An e-mail from Dr. Maxima and the GIS team on Apr 20, 1994
From maxima@miranda.phys.s.u-tokyo.ac.jp Wed Apr 20 06:02:18 1994 Received: by heasarc.gsfc.nasa.gov (5.65/DEC-Ultrix/4.3) id AA22983; Wed, 20 Apr 1994 06:09:42 -0400 Date: Wed, 20 Apr 94 19:05:34 +0900 From: maxima@miranda.phys.s.u-tokyo.ac.jp (Kazuo Makishima) Message-Id: <9404201005.AA25436@miranda.phys.s.u-tokyo.ac.jp> To: astrodteam@astro.isas.jaxa.jp Status: RO Dear Asca Colleagues: ************************************************************************ * We regret that we must inform you a trouble in one of the two GIS * * CPU memories, of which you may have already heard from Nagase-san * * via e-mail. The trouble caused the pulse hieght data of GIS-S3 to be * * edited somewhat in a wrong shape. This lasted from February 10 till * * April 8, for a length of two months. Below we describe the nature of * * the trouble, its solution, its cause, its impact upon the GO data * * acquired in this period, and proposed action for prevention of * * similar problems in future. * ************************************************************************ 1. The Trouble On April 9, we were noticed by the KSC duty scientists that the routine software for monitoring the GIS gain fails to fit the GIS-S3 isotope line. A quick investigation at U. Tokyo revealed that the S3 pulse height spectrum is in a wrong shape: normally the spectrum is output in 1024 (10 bits) channels covering up to 12 keV, but the spectrum obatined exhibited events only every 8 channels. Apparently the lower three bits of the pulse height information was fixed at [101] (or 5 in decimal) for all the S3 events. A backword search indicated that the problem persisted for a considerable length of time, without being noticed by anybody. We now understand that the trouble happened on February 10, between 22:05 and 22:35 UT, after passing through the SAA. The target being observed was AWM7. The reason why the problem was not noticed for two months is twofold. One is that the event rate is rather low, so that the spectrum obtained in the KSC quick look usually contains too few counts to reveal the problem. The other is of courese that the trouble happened in the GO phase, when the data could not be accessed by the hardware team. 2. Solution Such a trouble can be caused either by the hardware failure of the ADC (analog-to-digital converter) or related circuit elements in GIS-E, or by the software error. To discriminate between these two possibilities, on April 8 (first pass occured at JST 21:59) we performed GIS memory check (MEMCHK), and telemetered the contents of the GIS memory (two 32kB RAMs) down to the ground. Note that GIS MEMCHK can be done by a single discrete command, and takes 32 seconds. We found one errornous word in the CPU3 memory. The word was immediately corrected from KSC by a block command (called RAM-patch action), and the S3 pulse height spectrum returned normal. We admit that another problem happened during these operations, namely the rise time (RT) spectrum of S3 became strange. However this is nothing to do with the hard-wired processing, and was solved until the 5th contact (in early morning of April 9). For simplicity we will skip on this subject today. 3. Cuause of the Trouble On analyzing the memory-check results, we found that the faulty 8bit word was in the program area of the CPU3 memory. It exactly corresponds to the address pointer, pointing to the address where lower 5 bits of the hard-wire processed 12-bit PH information and lower 2 bits of the S3 event timing is stored. (Note that the GIS uses 12-bit ADCs, but we always discard lower 2 bits to get 10-bit PH information.) As this word was destroyed, the CPU was reading in some irrelevant information from some wrong (but harmless in view of the CPU operation) address. This completely explaines what had happened for the two months. The GIS memory has 4-bit error correction code (Hamming code), and 1-bit errors must be corrected by the CPU. However if 2-bit error occurs in a single word, the CPU will no longer operate normally. Actually the faulty word discovered in the memory check was [C230] in hex, while it should normally be [C201]. Therefore this is a 3-bit error in a word. This is very likely to be caused initially by a 2-bit error, i.e. [C201]-->[C231]. Then after an automatic Hamming-code correction, [C231] is modified into [C230]. In summary, a 2-bit error happened in a word in the CPU3 memory on February 10, probably within the SAA. This error did not cause the CPU hang-up, but instead made the CPU3 to read wrong information as to lower 5 (effectively lower 3) bits of the S3 PH information, as well as lower 2 bits of the S3 timing information. 4. Impact upon the GO Data The impact of the trouble on the GO data acquired between February 10 and April 9 is twofold. First, the lower 3 bits of the PH output information for all the S3 events are wrong. However if we bin up the 1024-ch (10 bit) of PH into 128-ch (7 bit), the resulting spectrum is completely normal. We therefore believe that the S3 data can still be utilized, especially for faint sources or sources without significant spectral feature. Another impact is on the timing information. The GIS events can be tagged with up to 10 bits of timing information in non-standard operation, and the lower 2 bits are lost due to the trouble. This do not affect any time information in normal-mode operation, or high-time-resolution data with timing bits equal to or less than 8 bits. As far as we are aware, there are a few observations during the period conducted with 10-bit time information. It has yet to be examined wheter the loss of lower 2 bits is fatal or not to these observations. 5. Future Prevention of Similar Troubles It seems that 2-bit error occurs more frequently than is inferred from the ocurrence rate of the 1-bit errors, although we do not yet know whether this is specific to GIS or not. Since the 2-bit error occurred on the logically adjacent bits, we suspect that it is not due to a chance coincidence of two 1-bit errors, but rather due to a single impact by an energetic particle. For the prevention of similar problems in future, we will caryy out the following two actions. Firstly, we have asked the operation taem to conduct the GIS memory check typically once every day. If errors are found, the KSC duty scientists and operators will immediately perform the RAM-patch action to correct the errors. We believe that this has already started at KSC. The other is to conduct the quick-look at University of Tokyo, typically once or twice a week, to see any strange behavior in the data. This means that a very small fraction of the GO data will be stored at Univ. Tokyo, but off course, we will not do any scientific research. Finally we, the GIS team, aplogize to all the colleagues for this problem and any inconvenience that might be caused by it. We would appreciate your thoughtful understanding. Regards, April 20 K. Makishima and M. Tashiro : University of Tokyo (maxima@miranda.phys.s.u-tokyo.ac.jp, tashiro@miranda.phys.s.u-tokyo.ac.jp) T. Ohahsi : Tokyo Metroploitan University (ohashi@phys.metro-u.ac.jp) M. Ishida : ISAS (ishida@astro.isas.jaxa.jp) and the GIS team
If you have any questions concerning ASCA, visit our Feedback form.