Biblio
One of the adverse effects of shrinking transistor sizes is that processors have become increasingly prone to hardware faults. At the same time, the number of cores per die rises. Consequently, core failures can no longer be ruled out, and future operating systems for many-core machines will have to incorporate fault tolerance mechanisms. We present CSR, a strategy for recovery from unexpected permanent processor faults in commodity operating systems. Our approach overcomes surprise removal of faulty cores, and also tolerates cascading core failures. When a core fails in user mode, CSR terminates the process executing on that core and migrates the remaining processes in its run-queue to other cores. We further show how hardware transactional memory may be used to overcome failures in critical kernel code. Our solution is scalable, incurs low overhead, and is designed to integrate into modern operating systems. We have implemented it in the Linux kernel, using Haswell's Transactional Synchronization Extension, and tested it on a real system.
System administrators have unlimited access to system resources. As the Snowden case shows, these permissions can be exploited to steal valuable personal, classified, or commercial data. In this work we propose a strategy that increases the organizational information security by constraining IT personnel's view of the system and monitoring their actions. To this end, we introduce the abstraction of perforated containers – while regular Linux containers are too restrictive to be used by system administrators, by "punching holes" in them, we strike a balance between information security and required administrative needs. Our system predicts which system resources should be accessible for handling each IT issue, creates a perforated container with the corresponding isolation, and deploys it in the corresponding machines as needed for fixing the problem. Under this approach, the system administrator retains his superuser privileges, while he can only operate within the container limits. We further provide means for the administrator to bypass the isolation, and perform operations beyond her boundaries. However, such operations are monitored and logged for later analysis and anomaly detection. We provide a proof-of-concept implementation of our strategy, along with a case study on the IT database of IBM Research in Israel.