Biblio

List
Filter

Found 22 results

Filters: Keyword is High performance computing [Clear All Filters]

2023-03-03

Nolte, Hendrik, Sabater, Simon Hernan Sarmiento, Ehlers, Tim, Kunkel, Julian. 2022. A Secure Workflow for Shared HPC Systems. 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). :965–974.

Driven by the progress of data and compute-intensive methods in various scientific domains, there is an in-creasing demand from researchers working with highly sensitive data to have access to the necessary computational resources to be able to adapt those methods in their respective fields. To satisfy the computing needs of those researchers cost-effectively, it is an open quest to integrate reliable security measures on existing High Performance Computing (HPC) clusters. The fundamental problem with securely working with sensitive data is, that HPC systems are shared systems that are typically trimmed for the highest performance - not for high security. For instance, there are commonly no additional virtualization techniques employed, thus, users typically have access to the host operating system. Since new vulnerabilities are being continuously discovered, solely relying on the traditional Unix permissions is not secure enough. In this paper, we discuss a generic and secure workflow that can be implemented on typical HPC systems allowing users to transfer, store and analyze sensitive data. In our experiments, we see an advantage in the asynchronous execution of IO requests, while reaching 80 % of the ideal performance.

2022-09-20

Chandramouli, Athreya, Jana, Sayantan, Kothapalli, Kishore. 2021. Efficient Parallel Algorithms for Computing Percolation Centrality. 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC). :111—120.

Centrality measures on graphs have found applications in a large number of domains including modeling the spread of an infection/disease, social network analysis, and transportation networks. As a result, parallel algorithms for computing various centrality metrics on graphs are gaining significant research attention in recent years. In this paper, we study parallel algorithms for the percolation centrality measure which extends the betweenness-centrality measure by incorporating a time dependent state variable with every node. We present parallel algorithms that compute the source-based and source-destination variants of the percolation centrality values of nodes in a network. Our algorithms extend the algorithm of Brandes, introduce optimizations aimed at exploiting the structural properties of graphs, and extend the algorithmic techniques introduced by Sariyuce et al. [26] in the context of centrality computation. Experimental studies of our algorithms on an Intel Xeon(R) Silver 4116 CPU and an Nvidia Tesla V100 GPU on a collection of 12 real-world graphs indicate that our algorithmic techniques offer a significant speedup.

2022-02-25

Xie, Bing, Tan, Zilong, Carns, Philip, Chase, Jeff, Harms, Kevin, Lofstead, Jay, Oral, Sarp, Vazhkudai, Sudharshan S., Wang, Feiyi. 2021. Interpreting Write Performance of Supercomputer I/O Systems with Regression Models. 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). :557—566.

This work seeks to advance the state of the art in HPC I/O performance analysis and interpretation. In particular, we demonstrate effective techniques to: (1) model output performance in the presence of I/O interference from production loads; (2) build features from write patterns and key parameters of the system architecture and configurations; (3) employ suitable machine learning algorithms to improve model accuracy. We train models with five popular regression algorithms and conduct experiments on two distinct production HPC platforms. We find that the lasso and random forest models predict output performance with high accuracy on both of the target systems. We also explore use of the models to guide adaptation in I/O middleware systems, and show potential for improvements of at least 15% from model-guided adaptation on 70% of samples, and improvements up to 10 x on some samples for both of the target systems.

2021-12-21

Mishra, Srinivas, Pradhan, Sateesh Kumar, Rath, Subhendu Kumar. 2021. Detection of Zero-Day Attacks in Network IDS through High Performance Soft Computing. 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS). :1199–1204.

The ever-evolving computers has its implications on the data and information and the threats that they are exposed to. With the exponential growth of internet, the chances of data breach are highly likely as unauthorized and ill minded users find new ways to get access to the data that they can use for their plans. Most of the systems today have well designed measures that examine the information for any abnormal behavior (Zero Day Attacks) compared to what has been seen and experienced over the years. These checks are done based on a predefined identity (signature) of information. This is being termed as Intrusion Detection Systems (IDS). The concept of IDS revolves around validation of data and/or information and detecting unauthorized access attempts with an intention of manipulating data. High Performance Soft Computing (HPSC) aims to internalize cumulative adoption of traditional and modern attempts to breach data security and expose it to high scale damage and altercations. Our effort in this paper is to emphasize on the multifaceted tactic and rationalize important functionalities of IDS available at the disposal of HPSC.

2021-09-16

Long, Saiqin, Yu, Hao, Li, Zhetao, Tian, Shujuan, Li, Yun. 2020. Energy Efficiency Evaluation Based on QoS Parameter Specification for Cloud Systems. 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). :27–34.

Energy efficiency evaluation (EEE) is a research difficulty in the field of cloud computing. The current research mainly considers the relevant energy efficiency indicators of cloud systems and weights the interrelationship between energy consumption, system performance and QoS requirements. However, it lacks a combination of subjective and objective, qualitative and quantitative evaluation method to accurately evaluate the energy efficiency of cloud systems. We propose a novel EEE method based on the QoS parameter specification for cloud systems (EEE-QoS). Firstly, it reduces the metric values that affect QoS requirements to the same dimension range and then establishes a belief rule base (BRB). The best-worst method is utilized to determine the initial weights of the premise attributes in the BRB model. Then, the BRB model parameters are optimized by the mean-square error, the activation weight is calculated, and the activation rules of the evidence reasoning algorithm are integrated to evaluate the belief of the conclusion. The quantitative and qualitative evaluation of the energy efficiency of cloud systems is realized. The experiments show that the proposed method can accurately and objectively evaluate the energy efficiency of cloud systems.

2021-08-02

Kong, Tong, Wang, Liming, Ma, Duohe, Chen, Kai, Xu, Zhen, Lu, Yijun. 2020. ConfigRand: A Moving Target Defense Framework against the Shared Kernel Information Leakages for Container-based Cloud. 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). :794—801.

Lightweight virtualization represented by container technology provides a virtual environment for cloud services with more flexibility and efficiency due to the kernel-sharing property. However, the shared kernel also means that the system isolation mechanisms are incomplete. Attackers can scan the shared system configuration files to explore vulnerabilities for launching attacks. Previous works mainly eliminate the problem by fixing operating systems or using access control policies, but these methods require significant modifications and cannot meet the security needs of individual containers accurately. In this paper, we present ConfigRand, a moving target defense framework to prevent the information leakages due to the shared kernel in the container-based cloud. The ConfigRand deploys deceptive system configurations for each container, bounding the scan of attackers aimed at the shared kernel. In design of ConfigRand, we (1) propose a framework applying the moving target defense philosophy to periodically generate, distribute, and deploy the deceptive system configurations in the container-based cloud; (2) establish a model to formalize these configurations and quantify their heterogeneity; (3) present a configuration movement strategy to evaluate and optimize the variation of configurations. The results show that ConfigRand can effectively prevent the information leakages due to the shared kernel and apply to typical container applications with minimal system modification and performance degradation.

2021-07-27

Loreti, Daniela, Artioli, Marcello, Ciampolini, Anna. 2020. Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults. 2020 International Symposium on Reliable Distributed Systems (SRDS). :266–275.

As large-scale linear equation systems are pervasive in many scientific fields, great efforts have been done over the last decade in realizing efficient techniques to solve such systems, possibly relying on High Performance Computing (HPC) infrastructures to boost the performance. In this framework, the ever-growing scale of supercomputers inevitably increases the frequency of faults, making it a crucial issue of HPC application development.A previous study [1] investigated the possibility to enhance the Inhibition Method (IMe) -a linear systems solver for dense unstructured matrices-with fault tolerance to single hard errors, i.e. failures causing one computing processor to stop.This article extends [1] by proposing an efficient technique to obtain fault tolerance to multiple hard errors, which may occur concurrently on different processors belonging to the same or different machines. An improved parallel implementation is also proposed, which is particularly suitable for HPC environments and moves towards the direction of a complete decentralization. The theoretical analysis suggests that the technique (which does not require check pointing, nor rollback) is able to provide fault tolerance to multiple faults at the price of a small overhead and a limited number of additional processors to store the checksums. Experimental results on a HPC architecture validate the theoretical study, showing promising performance improvements w.r.t. a popular fault-tolerant solving technique.

2021-07-08

Long, Saiqin, Li, Zhetao, Xing, Yun, Tian, Shujuan, Li, Dongsheng, Yu, Rong. 2020. A Reinforcement Learning-Based Virtual Machine Placement Strategy in Cloud Data Centers. :223—230.

{With the widespread use of cloud computing, energy consumption of cloud data centers is increasing which mainly comes from IT equipment and cooling equipment. This paper argues that once the number of virtual machines on the physical machines reaches a certain level, resource competition occurs, resulting in a performance loss of the virtual machines. Unlike most papers, we do not impose placement constraints on virtual machines by giving a CPU cap to achieve the purpose of energy savings in cloud data centers. Instead, we use the measure of performance loss to weigh. We propose a reinforcement learning-based virtual machine placement strategy(RLVMP) for energy savings in cloud data centers. The strategy considers the weight of virtual machine performance loss and energy consumption, which is finally solved with the greedy strategy. Simulation experiments show that our strategy has a certain improvement in energy savings compared with the other algorithms.

2021-05-03

Das, Arnab, Briggs, Ian, Gopalakrishnan, Ganesh, Krishnamoorthy, Sriram, Panchekha, Pavel. 2020. Scalable yet Rigorous Floating-Point Error Analysis. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. :1–14.

Automated techniques for rigorous floating-point round-off error analysis are a prerequisite to placing important activities in HPC such as precision allocation, verification, and code optimization on a formal footing. Yet existing techniques cannot provide tight bounds for expressions beyond a few dozen operators-barely enough for HPC. In this work, we offer an approach embedded in a new tool called SATIHE that scales error analysis by four orders of magnitude compared to today's best-of-class tools. We explain how three key ideas underlying SATIHE helps it attain such scale: path strength reduction, bound optimization, and abstraction. SATIHE provides tight bounds and rigorous guarantees on significantly larger expressions with well over a hundred thousand operators, covering important examples including FFT, matrix multiplication, and PDE stencils.

2020-08-24

Fargo, Farah, Franza, Olivier, Tunc, Cihan, Hariri, Salim. 2019. Autonomic Resource Management for Power, Performance, and Security in Cloud Environment. 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA). :1–4.

High performance computing is widely used for large-scale simulations, designs and analysis of critical problems especially through the use of cloud computing systems nowadays because cloud computing provides ubiquitous, on-demand computing capabilities with large variety of hardware configurations including GPUs and FPGAs that are highly used for high performance computing. However, it is well known that inefficient management of such systems results in excessive power consumption affecting the budget, cooling challenges, as well as reducing reliability due to the overheating and hotspots. Furthermore, considering the latest trends in the attack scenarios and crypto-currency based intrusions, security has become a major problem for high performance computing. Therefore, to address both challenges, in this paper we present an autonomic management methodology for both security and power/performance. Our proposed approach first builds knowledge of the environment in terms of power consumption and the security tools' deployment. Next, it provisions virtual resources so that the power consumption can be reduced while maintaining the required performance and deploy the security tools based on the system behavior. Using this approach, we can utilize a wide range of secure resources efficiently in HPC system, cloud computing systems, servers, embedded systems, etc.

2020-03-16

Udod, Kyryll, Kushnarenko, Volodymyr, Wesner, Stefan, Svjatnyj, Volodymyr. 2019. Preservation System for Scientific Experiments in High Performance Computing: Challenges and Proposed Concept. 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS). 2:809–813.

Continuously growing amount of research experiments using High Performance Computing (HPC) leads to the questions of research data management and in particular how to preserve a scientific experiment including all related data for long term for its future reproduction. This paper covers some challenges and possible solutions related to the preservation of scientific experiments on HPC systems and represents a concept of the preservation system for HPC computations. Storage of the experiment itself with some related data is not only enough for its future reproduction, especially in the long term. For that case preservation of the whole experiment's environment (operating system, used libraries, environment variables, input data, etc.) via containerization technology (e.g. using Docker, Singularity) is proposed. This approach allows to preserve the entire environment, but is not always possible on every HPC system because of security issues. And it also leaves a question, how to deal with commercial software that was used within the experiment. As a possible solution we propose to run a preservation process outside of the computing system on the web-server and to replace all commercial software inside the created experiment's image with open source analogues that should allow future reproduction of the experiment without any legal issues. The prototype of such a system was developed, the paper provides the scheme of the system, its main features and describes the first experimental results and further research steps.

2020-03-02

Yoshikawa, Takashi, Date, Susumu, Watashiba, Yasuhiro, Matsui, Yuki, Nozaki, Kazunori, Murakami, Shinya, Lee, Chonho, Hida, Masami, Shimojo, Shinji. 2019. Secure Staging System for Highly Confidential Data Built on Reconfigurable Computing Platform. 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). :308–313.

Cloud use for High Performance Computing (HPC) and High Performance Data Analytics (HPDA) is increasing. The data are transferred to the cloud and usually left there even after the data being processed. There is security concern for such data being left online. We propose secure staging system to prepare not only data but also computing platform for processing the data dynamically just while the data is processed. The data plane of the secure staging system has dynamic reconfigurability with several lower-than-IP-layer partitioning mechanisms. The control plane consists of a scheduler and a resource provisioner working together to reconfigure the partitioning in the data plane dynamically. A field trial system is deployed for treating secure data in dental school to be processed in the computer center with the location distance of 1km. The system shows high score in the Common Vulnerability Scoring System (CVSS) evaluation.

2020-02-10

Prout, Andrew, Arcand, William, Bestor, David, Bergeron, Bill, Byun, Chansup, Gadepally, Vijay, Houle, Michael, Hubbell, Matthew, Jones, Michael, Klein, Anna et al.. 2019. Securing HPC using Federated Authentication. 2019 IEEE High Performance Extreme Computing Conference (HPEC). :1–7.

Federated authentication can drastically reduce the overhead of basic account maintenance while simultaneously improving overall system security. Integrating with the user's more frequently used account at their primary organization both provides a better experience to the end user and makes account compromise or changes in affiliation more likely to be noticed and acted upon. Additionally, with many organizations transitioning to multi-factor authentication for all account access, the ability to leverage external federated identity management systems provides the benefit of their efforts without the additional overhead of separately implementing a distinct multi-factor authentication process. This paper describes our experiences and the lessons we learned by enabling federated authentication with the U.S. Government PKI and In Common Federation, scaling it up to the user base of a production HPC system, and the motivations behind those choices. We have received only positive feedback from our users.

2019-10-28

Ocaña, Kary, Galheigo, Marcelo, Osthoff, Carla, Gadelha, Luiz, Gomes, Antônio Tadeu A., De Oliveira, Daniel, Porto, Fabio, Vasconcelos, Ana Tereza. 2019. Towards a Science Gateway for Bioinformatics: Experiences in the Brazilian System of High Performance Computing. 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). :638–647.

Science gateways bring out the possibility of reproducible science as they are integrated into reusable techniques, data and workflow management systems, security mechanisms, and high performance computing (HPC). We introduce BioinfoPortal, a science gateway that integrates a suite of different bioinformatics applications using HPC and data management resources provided by the Brazilian National HPC System (SINAPAD). BioinfoPortal follows the Software as a Service (SaaS) model and the web server is freely available for academic use. The goal of this paper is to describe the science gateway and its usage, addressing challenges of designing a multiuser computational platform for parallel/distributed executions of large-scale bioinformatics applications using the Brazilian HPC resources. We also present a study of performance and scalability of some bioinformatics applications executed in the HPC environments and perform machine learning analyses for predicting features for the HPC allocation/usage that could better perform the bioinformatics applications via BioinfoPortal.

2019-05-01

Konstantelos, I., Jamgotchian, G., Tindemans, S., Duchesne, P., Cole, S., Merckx, C., Strbac, G., Panciatici, P.. 2018. Implementation of a Massively Parallel Dynamic Security Assessment Platform for Large-Scale Grids. 2018 IEEE Power Energy Society General Meeting (PESGM). :1–1.

This paper presents a computational platform for dynamic security assessment (DSA) of large electricity grids, developed as part of the iTesla project. It leverages High Performance Computing (HPC) to analyze large power systems, with many scenarios and possible contingencies, thus paving the way for pan-European operational stability analysis. The results of the DSA are summarized by decision trees of 11 stability indicators. The platform's workflow and parallel implementation architecture is described in detail, including the way commercial tools are integrated into a plug-in architecture. A case study of the French grid is presented, with over 8000 scenarios and 1980 contingencies. Performance data of the case study (using 10,000 parallel cores) is analyzed, including task timings and data flows. Finally, the generated decision trees are compared with test data to quantify the functional performance of the DSA platform.

2019-03-06

Fargo, F., Sury, S.. 2018. Autonomic Secure HPC Fabric Architecture. 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA). :1-4.

Cloud computing is the major paradigm in today's IT world with the capabilities of security management, high performance, flexibility, scalability. Customers valuing these features can better benefit if they use a cloud environment built using HPC fabric architecture. However, security is still a major concern, not only on the software side but also on the hardware side. There are multiple studies showing that the malicious users can affect the regular customers through the hardware if they are co-located on the same physical system. Therefore, solving possible security concerns on the HPC fabric architecture will clearly make the fabric industries leader in this area. In this paper, we propose an autonomic HPC fabric architecture that leverages both resilient computing capabilities and adaptive anomaly analysis for further security.

2018-06-11

DeYoung, Mark E., Salman, Mohammed, Bedi, Himanshu, Raymond, David, Tront, Joseph G.. 2017. Spark on the ARC: Big Data Analytics Frameworks on HPC Clusters. Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact. :34:1–34:6.

In this paper we document our approach to overcoming service discovery and configuration of Apache Hadoop and Spark frameworks with dynamic resource allocations in a batch oriented Advanced Research Computing (ARC) High Performance Computing (HPC) environment. ARC efforts have produced a wide variety of HPC architectures. A common HPC architectural pattern is multi-node compute clusters with low-latency, high-performance interconnect fabrics and shared central storage. This pattern enables processing of workloads with high data co-dependency, frequently solved with message passing interface (MPI) programming models, and then executed as batch jobs. Unfortunately, many HPC programming paradigms are not well suited to big data workloads which are often easily separable. Our approach lowers barriers of entry to HPC environments by enabling end users to utilize Apache Hadoop and Spark frameworks that support big data oriented programming paradigms appropriate for separable workloads in batch oriented HPC environments.

2018-06-07

Alazzawe, A., Kant, K.. 2017. Slice Swarms for HPC Application Resilience. 2017 Fifth International Symposium on Computing and Networking (CANDAR). :1–10.

Resilience in High Performance Computing (HPC) is a constraining factor for bringing applications to the upcoming exascale systems. Resilience techniques must be able to scale to handle the increasing number of expected errors in an energy efficient manner. Since the purpose of running applications on HPC systems is to perform large scale computations as quick as possible, resilience methods should not add a large delay to the time to completion of the application. In this paper we introduce a novel technique to detect and recover from transient errors in HPC applications. One of the features of our technique is that the energy budget allocated to resilience can be adjusted depending on the operator's resilience needs. For example, on synthetic data, the technique can detect about 50% of transient errors while only using 20% of the dynamic energy required for running the application. For a 60% energy budget, an application that uses 10k cores and takes 128 hours to run, will only require 10% longer to complete.

2018-05-16

Hukerikar, Saurabh, Ashraf, Rizwan A., Engelmann, Christian. 2017. Towards New Metrics for High-Performance Computing Resilience. Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale. :23–30.

Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presence of various modes of faults.

2018-02-21

Zhou, G., Feng, Y., Bo, R., Chien, L., Zhang, X., Lang, Y., Jia, Y., Chen, Z.. 2017. GPU-Accelerated Batch-ACPF Solution for N-1 Static Security Analysis. IEEE Transactions on Smart Grid. 8:1406–1416.

Graphics processing unit (GPU) has been applied successfully in many scientific computing realms due to its superior performances on float-pointing calculation and memory bandwidth, and has great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve N-1 SSA problem, the degree of parallelism is limited because existing researches have been devoted to accelerating the solution of a single ACPF. This paper therefore proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of overall parallelism. First, this paper establishes two basic principles for determining well-designed GPU algorithms, through which the limitation of GPU-accelerated sequential-ACPF solution is demonstrated. Next, being the first of its kind, this paper proposes a novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses. To further improve the efficiency of solving SSA, a GPU-accelerated batch-Jacobian-Matrix generating and contingency screening is developed and carefully optimized. Lastly, the complete process of the proposed GPU-accelerated batch-ACPF solution for SSA is presented. Case studies on an 8503-bus system show dramatic computation time reduction is achieved compared with all reported existing GPU-accelerated methods. In comparison to UMFPACK-library-based single-CPU counterpart using Intel Xeon E5-2620, the proposed GPU-accelerated SSA framework using NVIDIA K20C achieves up to 57.6 times speedup. It can even achieve four times speedup when compared to one of the fastest multi-core CPU parallel computing solution using KLU library. The prop- sed batch-solving method is practically very promising and lays a critical foundation for many other power system applications that need to deal with massive subtasks, such as Monte-Carlo simulation and probabilistic power flow.

2015-05-05

Campbell, S.. 2014. Open science, open security. High Performance Computing Simulation (HPCS), 2014 International Conference on. :584-587.

We propose that to address the growing problems with complexity and data volumes in HPC security wee need to refactor how we look at data by creating tools that not only select data, but analyze and represent it in a manner well suited for intuitive analysis. We propose a set of rules describing what this means, and provide a number of production quality tools that represent our current best effort in implementing these ideas.

2015-04-30

Smith, S., Woodward, C., Liang Min, Chaoyang Jing, Del Rosso, A.. 2014. On-line transient stability analysis using high performance computing. Innovative Smart Grid Technologies Conference (ISGT), 2014 IEEE PES. :1-5.

In this paper, parallelization and high performance computing are utilized to enable ultrafast transient stability analysis that can be used in a real-time environment to quickly perform “what-if” simulations involving system dynamics phenomena. EPRI's Extended Transient Midterm Simulation Program (ETMSP) is modified and enhanced for this work. The contingency analysis is scaled for large-scale contingency analysis using Message Passing Interface (MPI) based parallelization. Simulations of thousands of contingencies on a high performance computing machine are performed, and results show that parallelization over contingencies with MPI provides good scalability and computational gains. Different ways to reduce the Input/Output (I/O) bottleneck are explored, and findings indicate that architecting a machine with a larger local disk and maintaining a local file system significantly improve the scaling results. Thread-parallelization of the sparse linear solve is explored also through use of the SuperLU_MT library.