Visible to the public Biblio

Filters: Keyword is datacenter  [Clear All Filters]
2021-04-27
Banakar, V., Upadhya, P., Keshavan, M..  2020.  CIED - rapid composability of rack scale resources using Capability Inference Engine across Datacenters. 2020 IEEE Infrastructure Conference. :1–4.
There are multiple steps involved in transitioning a server from the factory to being fully provisioned for an intended workload. These steps include finding the optimal slot for the hardware and to compose the required resources on the hardware for the intended workload. There are many different factors that influence the placement of server hardware in the datacenter, such as physical limitations to connect to a network be it Ethernet or storage networks, power requirements, temperature/cooling considerations, and physical space, etc. In addition to this, there may be custom requirements driven by workload policies (such as security, data privacy, power redundancy, etc.). Once the server has been placed in the right slot it needs to be configured with the appropriate resources for the intended workload. CIED will provide a ranked list of locations for server placement based on the intended workload, connectivity and physical requirements of the server. Once the server is placed in the suggested slot, the solution automatically discovers the server and composes the required resources (compute, storage and networks) for running the appropriate workload. CIED reduces the overall time taken to move hardware from factory to production and also maximizes the server hardware utilization while minimizing downtime by physically placing the resources optimally. From the case study that was undertaken, the time taken to transition a server from factory to being fully provisioned was proportional to the number of devices in the datacenter. With CIED this time is constant irrespective of the complexity or the number of devices in a datacenter.
2020-12-02
Islam, S., Welzl, M., Gjessing, S..  2019.  How to Control a TCP: Minimally-Invasive Congestion Management for Datacenters. 2019 International Conference on Computing, Networking and Communications (ICNC). :121—125.

In multi-tenant datacenters, the hardware may be homogeneous but the traffic often is not. For instance, customers who pay an equal amount of money can get an unequal share of the bottleneck capacity when they do not open the same number of TCP connections. To address this problem, several recent proposals try to manipulate the traffic that TCP sends from the VMs. VCC and AC/DC are two new mechanisms that let the hypervisor control traffic by influencing the TCP receiver window (rwnd). This avoids changing the guest OS, but has limitations (it is not possible to make TCP increase its rate faster than it normally would). Seawall, on the other hand, completely rewrites TCP's congestion control, achieving fairness but requiring significant changes to both the hypervisor and the guest OS. There seems to be a need for a middle ground: a method to control TCP's sending rate without requiring a complete redesign of its congestion control. We introduce a minimally-invasive solution that is flexible enough to cater for needs ranging from weighted fairness in multi-tenant datacenters to potentially offering Internet-wide benefits from reduced interflow competition.

2019-03-04
Lin, F., Beadon, M., Dixit, H. D., Vunnam, G., Desai, A., Sankar, S..  2018.  Hardware Remediation at Scale. 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). :14–17.
Large scale services have automated hardware remediation to maintain the infrastructure availability at a healthy level. In this paper, we share the current remediation flow at Facebook, and how it is being monitored. We discuss a class of hardware issues that are transient and typically have higher rates during heavy load. We describe how our remediation system was enhanced to be efficient in detecting this class of issues. As hardware and systems change in response to the advancement in technology and scale, we have also utilized machine learning frameworks for hardware remediation to handle the introduction of new hardware failure modes. We present an ML methodology that uses a set of predictive thresholds to monitor remediation efficiency over time. We also deploy a recommendation system based on natural language processing, which is used to recommend repair actions for efficient diagnosis and repair. We also describe current areas of research that will enable us to improve hardware availability further.
2018-02-21
Hu, Yao, Hara, Hiroaki, Fujiwara, Ikki, Matsutani, Hiroki, Amano, Hideharu, Koibuchi, Michihiro.  2017.  Towards Tightly-coupled Datacenter with Free-space Optical Links. Proceedings of the 2017 International Conference on Cloud and Big Data Computing. :33–39.

Clean slate design of computing system is an emerging topic for continuing growth of warehouse-scale computers. A famous custom design is rackscale (RS) computing by considering a single rack as a computer that consists of a number of processors, storages and accelerators customized to a target application. In RS, each user is expected to occupy a single or more than one rack. However, new users frequently appear and the users often change their application scales and parameters that would require different numbers of processors, storages and accelerators in a rack. The reconfiguration of interconnection networks on their components is potentially needed to support the above demand in RS. In this context, we propose the inter-rackscale (IRS) architecture that disaggregates various hardware resources into different racks according to their own areas. The heart of IRS is to use free-space optics (FSO) for tightly-coupled connections between processors, storages and GPUs distributed in different racks, by swapping endpoints of FSO links to change network topologies. Through a large IRS system simulation, we show that by utilizing FSO links for interconnection between racks, the FSO-equipped IRS architecture can provide comparable communication latency between heterogeneous resources to that of the counterpart RS architecture. A utilization of 3 FSO terminals per rack can improve at least 87.34% of inter-CPU/SSD(GPU) communication over Fat-tree and improve at least 92.18% of that over 2-D Torus. We verify the advantages of IRS over RS in job scheduling performance.