A faulty software update, in a seemingly standard, routine IT operating procedure has again resulted in a global meltdown of numerous IT systems causing pain to millions of airline and rail passengers, telecom services users, banking operations, patients waiting for healthcare procedures, and many others relying on technology to satisfy their everyday needs.
According to Forbes, Microsoft and the cybersecurity firm CrowdStrike have issued fixes for the outage that impacted Windows devices starting on July 19, 2024, though issues still linger causing major disruptions to IT operations globally.*
In an IT world moving consistently towards hyper-automation and AI-driven operations, a software defect, a newly discovered vulnerability or the current faulty software update will earn the vendor–CrowdStrike in this case–a spot on the front-page articles of most technology and business media outlets, accompanied by negative quotes from angry customers and users impacted by the update that brought down Windows machines globally. The reputational damage goes far beyond the vendor and causes dissatisfaction for a cascading list of applications and environments relying on the faulty software component.
The CrowdStrike error is related to the Falcon Sensor running on Windows hosts.** The impact is inflated by the fact that though the resolution is simple and straightforward in some cases, it requires finding the impacted machines and devices across the IT landscape. In many cases re-booting and reinstalling cannot be done remotely, therefore, IT staff need to physically access the device and operating system.
A forecast by International Data Corporation (IDC) estimates that there will be 41.6 billion Internet of Things (IoT) devices in 2025*** and the expansion of devices of all sorts – IoT, servers, desktops, mobile, embedded, hybrid cloud and on premises is forecasted to continue at an accelerated pace.
With the spread of technology to every domain of our lives and the massive expansion of IoT-connected devices, which can be located in remote and inaccessible locations, the need to quickly identify and assess impacts of such incidents becomes critical to deliver quick recovery and restoration of services ranging from essential infrastructure and safety-critical environments to revenue-generating, digital entertainment, and customer services.
This is where comprehensive IT Asset Management (ITAM) best practices are essential for technology vendors and enterprise organizations, as ITAM provides the oversight and control over the breadth of IT assets including inventory management, automated discovery, locations mapping and isolation of resources deployed to specific environments. Advanced ITAM solutions cover the broad spectrum of critical assets including diverse device types deployed in hybrid clouds or on premises and both hardware and software characteristics of these devices.
By maintaining accurate asset records and relationships between assets (or configuration items) in the Device42 Configuration Management Database (CMDB), companies can deliver proactive IT management, and quickly react to failures regardless of the source, therefore minimizing downtime and ensuring that the business-critical and essential services operate with minimal interruption.
The CMDB enables efficient ITAM and ITSM workflows, provides a centralized organizational IT asset repository along with attached discoverable and augmented information, enabling IT teams to access necessary data while an incident–like the faulty upgrade–affects their organization. Mapping virtual to physical assets and tracking the location of those assets significantly accelerates the ability of the on-site IT technicians to find impacted devices and perform the manual restarts and administrative tasks required for resolving the Falcon Sensor error.
Advanced ITAM platforms such as Device42 also allow the enterprise to quickly map potential impact of an outage by examining the dependency maps of business services and underlying network, hardware and software devices, assessing the impact potential of both the ongoing specific devices outage as well as potential impacts of a change request required for remediation.
To further learn how implementing effective ITAM practices and solutions in the enterprise IT environment accelerates problem resolution and IT response to incidents, let’s take an in depth look at the July 19 upgrade outage. According to the CrowdStrike incident analysis, crashes on Windows hosts related to the Falcon Sensor can be identified based on the following:**
- Hosts experiencing a bug check\blue screen error related to the Falcon Sensor. [in many cases this is resulting in inaccessible and unresponsive devices]
- Windows hosts which are bought online after 0527 UTC or running Windows 7/2008 R2 will not be impacted
- This issue is not impacting Mac- or Linux-based hosts
- Channel file “C-00000291*.sys” with a timestamp of 0409 UTC is the problematic version
- Navigate to the %WINDIR%\System32\drivers\CrowdStrike directory to delete the file
CrowdStrike published fix and resolution steps for on premises and cloud environments such as AWS and Azure. These steps are fairly simple once the impacted devices are identified. However, precious time is usually spent during similar incident management workflows on locating what is impacted, what is the root cause and identifying business services depending on the affected components.
There are a few different ways in which an advanced CMDB and ITAM solution such as Device42 allows users to quickly find affected devices given the detailed classification in this specific case. These same queries and similar techniques can be used together or independently when less details are available from the vendors, media and user communities or very early in the resolution response cycle when early IT service requests arrive prior to broad publication of the incident details.
Identifying specific software components, running services or vendors:
The report below lists devices which contain the impacted CrowdStrike software component, in the case of the July 19 incident.
The report is generated by running the following Device42 CMDB query to identify devices impacted by the CrowdStrike Falcon Sensor update error running on Windows hosts:
select
d.name as device_name,
d.type ,
d.os_name ,
TO_CHAR(d.os_last_edited, 'YYYY-MM-DD') AS os_last_edited,
TO_CHAR(d.last_discovered, 'YYYY-MM-DD') AS last_discovered,
s.name as software_name,
v.name as vendor_name,
siu."version"
from view_softwareinuse_v1 siu
left join view_software_v1 s on s.software_pk = siu.software_fk
left join view_vendor_v1 v on v.vendor_pk = s.vendor_fk
left join view_device_v2 d on d.device_pk = siu.device_fk
where 1=1
and lower(d.os_name) like '%win%'
and lower(d.os_name) not like '%windows 7%'
and lower(v."name") ilike '%crowdstrike%'
and lower(s."name") like '%sensor%';
While this query specifically identifies devices impacted by the July 19 incident, it provides an example for queries that could be used in similar incidents to identify an impacted software component.
Identifying hosts that are no longer responding in the latest discovery:
The report below lists devices which are no longer responsive or available. This type of report allows IT professionals to quickly identify changes in their infrastructure to provide quick problem isolation and response accordingly.
The report is generated by running the following Device42 CMDB query. This query can be used for similar use cases where the IT team needs to quickly identify devices that are no longer available or experience some other change that may be related to an incident.
select
d.name as device_name,
d.type ,
d.os_name ,
TO_CHAR(d.last_discovered, 'YYYY-MM-DD') AS device_last_discovered
from view_device_v2 d
where 1=1
and d.last_discovered >= CURRENT_DATE - INTERVAL '2 days'
AND d.last_discovered < CURRENT_DATE;
Identifying hosts on which specific operating systems are deployed while excluding devices with operating systems not affected:
The dashboard below displays a breakdown of devices by operating system and version. As the specific operating systems and versions impacted by the Crowdstrike failure are fully documented, Windows not running 7 or Windows 2008 R2, this report allows IT administrators to identify devices that need to be further examined for remediation. This report is one of many dashboards providing quick access to reports addressing specific use cases. The reports, which can be customized further if needed, are available under Insights+ in the Device42 main appliance user interface.
Evaluating all the dependencies of devices and business services of an impacted device:
The example below displays a specific Windows device dependency which impacts a business application. The dependency maps are displayed in the Device42 main appliance user interface.
Utilizing these common CMDB queries and ITAM best practices allow the enterprise IT organization to standardized on and use Device42 to quickly respond and effectively manage incidents, such as the July 19 CrowdStrike Windows devices outage, but also other incidents ranging from security vulnerabilities and cyber attacks to software defects and human errors.
Sources:
**https://www.crowdstrike.com/blog/statement-on-windows-sensor-update/