Data Center Infrastructure Management: A Comprehensive Guide
Overall, a data center is a combination of the total infrastructure required to house, transmit, or process data in a facility as well as the team of employees associated with it. Data center infrastructure management (DCIM) is the combination of installation, maintenance, and business best practices associated with the operations of that facility.
In the installation and design phase of the data center, DCIM is functionally a method for understanding the scope and needs of the facility via simulation or other forecasting methods. After installation, when a maintenance plan is set in place, DCIM represents the means to break down maintenance requirements and help craft an effective support model that mitigates risks and addresses problems as they arise. Finally, with the installation and a maintenance plan complete, DCIM is largely about the management structure and business practices associated with the DCIM team.
The term “DCIM” is often associated with the tools that support the delineation and monitoring of the operational capacity of a facility. However, it’s important to note that DCIM is about more than just that: It is fundamentally about a specific way of thinking that positively impacts every phase of a data center’s creation and operation.
Summary of key data center infrastructure concepts
Data center infrastructure management captures three core aspects of a data center that often overlap but can be separated as follows:
- Installation is the process of using DCIM tools to evaluate a system architecture and formulate a plan for how a data center should be built.
- Maintenance, which encompasses both preventative and responsive methodologies, is a set of guidelines or principles that govern how DCIM systems keep the data center running.
- Management generally relates to the business operations of a facility and also covers the overall infrastructure organization necessary to support the data center.
The table below provides more detail on these key concepts.
What is DCIM? | DCIM encompasses the design, installation, maintenance, and business management processes intended to meet facility infrastructure and operational needs. |
Installation | Installation includes facility design, interfaces with external stakeholders, and the planning associated with data center development. |
Maintenance | Maintenance includes the support structures, plans, and best practices needed to meet customer and facility requirements. |
Management | Management, in the context of DCIM, includes all tasks associated with meeting data center management goals; it centers on business acquisition, management hierarchies, and overall site standards of operation. |
What is DCIM?
Data center infrastructure management is often described in terms of specific management software or tools used to manage the electrical, thermal, and network infrastructures of a data center. However, DCIM software versions are varied, based on specific capabilities and limitations, and only one piece of the DCIM puzzle. This is because DCIM is a combination of the lengthy process associated with planning and building a data center, the maintenance model of the data center, and the overall management of the business and its composite systems. Using this definition, it is clear that DCIM is a framework by which the organization and operations of a data center can be executed.
The three main topics associated with DCIM are installation, maintenance, and management. Within each of these topics, there are a number of important facets that will determine the scope and key concerns involved in data center execution.
Installation includes building a data center, integrating equipment into a system, and coordinating stakeholders:
- Floorplan and physical elements: Walls, flooring, ceiling, and volumetric capacity
- External stakeholders: Power utilities, construction teams, and network access providers
- Integration process/plan: Determining how racks and other hardware are installed and ensuring compliance with specific site standards
Maintenance covers upkeep and the support model after the center is built, including the following:
- Best practices for infrastructure-specific maintenance models
- Best practices for tiered levels of support associated with data center use cases
- Environmental management systems associated with DCIM structures
Management deals with the business model of the data center, incorporating infrastructure usage and needs:
- Customer acquisition and support based on data center use cases
- Staffing and resource allocation to match infrastructure design
- Best practices for management models and DCIM organization plans
Installation
Installation of a data center is a non-trivial process that involves the thorough planning of the physical layout of the space, the internal infrastructure, and both present and future facility capacity. This involves working with external stakeholders to organize the power, thermal, and network capabilities. It also encompasses working out how the data center will support the solely owned/operated hardware or rack components owned by other tenants.
All of this can be described, in general, as the use case of the data center: how the facility will be operated as either an extension of a single company or within a host/tenant agreement. Across these specific concerns, DCIM is used to understand the goals of the system architecture and finalize a design that fulfills them.
Physical design
The physical design of the space can be loosely delineated in terms of the floorplan, wall structures, flooring, ceiling, and finally, the volumetric capacity.
The floorplan is used for contracted builders to isolate where rack space or office space (AKA white space or gray space) will be designated. Following a floor plan, contracted builders will then establish wall composite structures, floor panels, and ceiling construction based on DCIM specifications. Office walls and non-hardware-adjacent walls may be made without thermal insulation and may not include the necessary conduit or other hardware used to run electrical and network cables.
The floor and ceiling would be designed based on the thermal infrastructure, typically employing a raised floor and plenum design where heat transfer is achieved by moving cool air from the floor through hardware until eventually passing into void space in the ceiling. The air is then chilled and returned back to the floor, completing the cycle.
External stakeholders
The design and installation of a data center will depend largely on the services provided by external stakeholders. Some examples of these would include the power utility, water utility, and network service providers. In the context of the installation of a data center, DCIM requires interacting with these stakeholders to determine power availability and what sort of networking bandwidth or drops are provided (such as fiber optic or coaxial cable).
Data center infrastructure management teams responsible for the design and installation of the data center must work with external stakeholders to pass along key requirements based on facility needs and service provider limitations.
Integration planning
One of the most important considerations in the design and installation of a data center is the integration step, where key infrastructure of operational hardware is installed in a way that ensures operation as a cohesive whole. This includes generating installation timelines, acknowledging dependencies in schedule and system requirements, and finally the physical installation of system components that are tested to ensure operability.
Examples of operational hardware include hardware racks, network routing/storage equipment, and electrical components used for power management and allocation. In the transition from a paper plan to the physical installation of hardware, DCIM will be used to ensure that all specifications are met and that capacity planning has been done in accordance with external stakeholder capabilities.
Similarly, once a data center is operational, installing new rack components or hardware will depend on DCIM support to ensure that all infrastructure requirements are met. DCIM will also be used to test installed hardware so that it meets all operational specifications and has been effectively integrated into the facility. This is the case for any ownership or usage model of a data center: Whether the installed hardware is owned by a separate entity (tenant in a host/tenant arrangement) or the data center proprietor, integration is directly connected to DCIM.
Maintenance
Data center maintenance focuses on the support required for upkeep and problem resolution to maintain the overall operation of the facility. In terms of DCIM, maintenance of a data center will be functionally connected to the types of infrastructure present in the system, as well as the model or approach used by the management team.
In general, a tiered approach is best practice, as this helps organize the levels of support by difficulty, urgency, or required technical skill. Tiers represent the importance and difficulty of maintenance and are described further below.
Finally, tool sets associated with DCIM, typically in the form of software solutions, revolve around the monitoring of key infrastructure. A maintenance team would utilize these monitoring tools to help navigate the priority level of maintenance and notice if/when a problem arises.
Infrastructure maintenance
The three foundational infrastructures associated with a data center (electrical, thermal, and network) each present unique maintenance challenges. DCIM depends on individualized maintenance approaches to ensure continued operation without fault or issue in each area.
Response-based electrical maintenance may take the form of repairing broken cables, conduit, or outlets. However, preventative maintenance is equally important and may take the form of doing electrical harmonic analysis on new racks before integration or general upkeep of power distribution equipment.
In terms of thermal infrastructure, maintenance may involve cleaning air filters, assessing air damper performance in the ceiling, and monitoring ambient air temperature.
Finally, on the network side, maintenance may include the testing of fiber and Ethernet cables, performing software or firmware upgrades, and conducting security scans on installed hardware to detect potential breaches.
Tiered maintenance
A tiered approach is often utilized as a tool to help organize the needs of specific hardware components and individual tenants/customers with racks installed in a data center, so the appropriate maintenance resources are leveraged to solve a maintenance issue. Tiers are rated on a scale of 1-5 (sometimes 0-4), ranging from superficial maintenance of shared spaces, such as cleaning, debris removal, or air filter replacement, all the way to highly technical troubleshooting. As an example, Tier 3 (or 2 in a 0-4 scheme) maintenance may require a set of specific standard operating procedures (SOPs) in order to resolve a maintenance issue. This would include something like instructions to replace faulty equipment or effectively move data from one workstation to another while a system is being repaired. In contrast, Tier 1 (or 0) support could be something simple like changing a warning/alarm lightbulb or localized rack air filter. Tier 5 (or 4) almost always involves an external vendor or designer of a hardware component.
Environmental monitoring
Regardless of the tier level of support or the specific infrastructure that may be the subject of maintenance, monitoring tools are an effective way to collate data and assess situations. In the context of DCIM, system maintainers would likely be responsible for the installation of these monitors as well as the overall management of potential flags or warnings. One example could be network monitoring tools used to identify high-traffic conditions or security breaches. A DCIM network environment monitoring tool would be used to identify the issue, in this case either reduced capacity or a critical security risk, and then help the maintenance team determine the next steps.
Root cause analysis will likely depend on the data archived by the monitoring network, meaning that an effective information system will help decrease the impact of the issue by highlighting potential solutions. In the example presented above—high-traffic conditions or a security breach—this might mean shifting network traffic from one router to another or removing compromised hardware from the system.
Management
DCIM is a collection of installation and maintenance structures, but it also covers business operations. Generally speaking, data center management can be described as a process where the use case of the data center sets requirements for how a facility is operated.
The management of infrastructure and the various hardware elements of a system will be affected by the overall mission of the data center. If, for example, a data center falls into the category of a shared workspace or colocation (COLO), then data center management will also include the process of dealing with the individual needs of tenants/customers. In this way, the DCIM team will need to also provide the business functions of customer acquisition and support.
Finally, there is a wide variety of standards and management styles that can be leveraged to increase operational efficiency and decrease downtime associated with critical hardware failures. Implementations of these management models will fundamentally affect both the design and operation of a data center and represent the potential for increased overhead either in their initial deployment or as an increased sustainment cost. Therefore, each management system has its pros and cons, and it is fundamentally up to the DCIM team and parent organization to determine the one that fits best with operational goals.
Customer acquisition and support
Data center management often requires the navigation of business concepts such as customer acquisition, supply chain management, and overall support. This may take the form of finding new tenants to occupy space in the data center and managing various rent/maintenance agreements based on customer needs.
Management also includes organizing an adequate sparing posture to ensure that critical infrastructure components can be replaced without incident. This means a strong handle on the supply chain, spanning the processes of ordering, shipping, receiving, and inventory management.
Finally, when problems arise related to the services being provided, then the data center management team will have to take the role of customer service and help resolve any issues, necessitating the availability of services like a call center and ticketing systems.
DCIM team
As mentioned before, the DCIM team is a broad functional group that can include people from a variety of functions, such as technical advisors, business representatives, and even supply chain managers. Of course, the team also must include specialists or managers who are responsible for critical infrastructure.
Ideally outfitting the broader DCIM team depends on the infrastructure used in the system and the maintenance expectations for service. For example, this will likely include thermal engineers who are equipped to design and maintain the thermal layout of a facility, provide support when new systems are being integrated, and help organize a maintenance model that suits all occupants in the data center.
Management best practices
There is no single management model that captures all the potential issues in the operation of a data center. However, relying on international standards for standards of care, such as ISO standards, can help establish a benchmark for security and safety.
In addition, having a strong organizational structure with well-defined roles and responsibilities will help escalate potential problems up the chain of command. Having boards or technical exchange meetings (TEMs) on a regular basis will help the team organize new integration efforts and may help mitigate risks before they are realized as issues.
Finally, hiring the right people and having a clear understanding of the statement of work will help prevent any design/maintenance oversights that occur in the course of operations.
Conclusion
DCIM is a broad concept that encompasses the needs of a data center from design through full operation. Critical infrastructure such as power, thermal/HVAC systems, and networking components must be designed and maintained based on a model suited to the end goals of the facility. Environmental monitoring tools for safety, security, and problem-solving expediency are also necessary features of DCIM support. Finally, having a strong organizational structure that allows for sustainable growth is fundamental to the success of the DCIM team and the data center as a whole.