トランザクションデジタルプラクティス Vol.6 No.1(Jan. 2025)

Efficient Curation of ICS Cybersecurity Information Using Large Language Models

Wataru Matsuda1  Mariko Fujimoto2  Takuho Mitsunaga2  Kenji Watanabe1

1Nagoya Institute of Technology, Nagoya, Aichi 466–8555, Japan  2Toyo University, Kita, Tokyo 115–8650, Japan 

In recent years, control systems have rapidly advanced and increasingly tend to be connected to IT networks and the Internet. In environments where IT and Industrial Control Systems (ICS) are interconnected, there is a risk of intrusion via the IT network. Nowadays, IT technologies are integrated into ICS, so it is crucial to consider IT attack risks in ICS environments in addition to ICS-specific attacks. A vast amount of information on attack tools and cyberattack reports has been published.Security analysts must analyze or meticulously read this information to determine if the attacks are relevant to their organization and how they should be defended against, necessitating a curation process. However, understanding the content of all published attack methods and reports properly requires significant resources, including costs and skills based on experience. Therefore, this research investigates the practical use of Large Language Models (LLMs) for extracting information beneficial to an organization's security measures efficiently. Specifically, we examined whether it is possible to identify protocols and ports from public information that could be exploited in attacks.These information are helpful in preventing or monitoring these attacks using tools such as firewalls, even if timely security updates are difficult. This examination was conducted from the following two perspectives:
・Extracting port numbers to be protected and monitored against attacks targeting IT networks, especially Windows environments, based on Proof of Concept (PoC) information on the Internet.
・From the perspective of ICS networks, extracting exploited protocols, port numbers, and product names from past ICS-related reports.
The goal of the research is to prepare for attacks in advance, identify exploitable products and protocols. The results obtained from the proposed method can be utilized for mitigation and enhanced monitoring. Furthermore, they can also be applied to risk assessment and penetration testing. Using the proposed method, we were able to extract port numbers with a potential for misuse in IT attacks with a 60.0% correct response rate. For ICS, we achieved an 81.8% correct response rate in extracting potentially exploited port numbers and protocol names, and a 72.7% correct response rate in identifying target products.

curation, industrial control system, LLM

1. Introduction

ICS were traditionally considered low cybersecurity risk due to their isolated network from other systems. However, recent advancements have rapidly transformed control systems, leading to an increased tendency for these systems to connect to IT networks and the Internet. Furthermore, nowadays, generic industrial protocols implemented over Ethernet are becoming common. The use of generic technologies allows for the open availability of a wealth of information and tools, facilitating attack activities. When ICS networks are connected to other networks, there is a security risk of intrusion into the ICS network through the connected adjacent networks. To identify and prevent such cyber intrusions, the SANS Institute has defined an intelligence-led defense model known as the “Cyber Kill Chain of ICS”[1]. According to this model, attack activities are categorized into attacks on IT systems and attacks on ICS. For instance, attackers might first infiltrate an employee's PC and then penetrate the ICS network through servers accessible from both networks. Once intruded on the ICS network, attackers attempt to compromise ICS devices, for instance, controllers and PLCs being typical targets [2], [3]. Nowadays, IT technologies are integrated into ICS, so it is crucial to consider IT attack risks in ICS environments in addition to ICS-specific attacks. To assess the attack risk to an organization from the daily published attack tools and techniques, collecting information from attack tools and reports published by security vendors is effective. This necessitates a process of curation. However, a vast amount of information on attack tools and reports is published daily. Understanding all published attack tools and reports requires significant resources, including costs and skills based on experience. Therefore, this study examines efficient curation methods using Large Language Models (LLMs) from both IT and ICS perspectives, detailed in chapter 3 and 4, respectively. This research primarily aims to automate the collection of security information for deployed products, specifically focusing on the automatic collection of affected port numbers, protocols, and product names. This approach is considered useful for monitoring cyber attacks and mitigating their impact, particularly in ICS environments where long life cycles and high availability are prioritized, making early updates challenging.

1.1 Necessity of Automating Curation

To defend control systems from cyber attacks, it is crucial to understand the vulnerabilities and malware that attackers exploit, necessitating a process of curation. The manner of writing vulnerability reports varies by vendor, making it challenging to identify exploited protocol names, port numbers, and product names by simple mechanical processing or manually. Analyzing information sources to extract protocols, port numbers, and targeted products exploited in attacks would enable personnel to review firewall settings and assess the risks of products used in each organization. This study proposes a method using LLMs to analyze vulnerability information and reports, automatically extracting information useful for enhancing defense and monitoring. By leveraging LLMs, we can streamline the curation process, allowing for more efficient identification of potential security threats and facilitating proactive cybersecurity measures.

1.2 Characteristics of Attacks Targeting ICS

In recent years, there has been a trend for control systems to connect with IT networks and the Internet. Many IT networks are composed of Windows systems. In addition, there is a convergence of IT and Operational Technology (OT) within ICS environments; Windows systems are also becoming common in ICS environments [4]. Consequently, attacks targeting ICS often aim at Windows systems and expand infections to compromise the ICS environment. According to the “V. INDUSTRIAL CONTROL SYSTEMS AND CRITICAL INFRASTRUCTURE INCIDENTS” section in [5], malware such as Stuxnet, Duqu, Shamoon, Havex, BlackEnergy, Industroyer, Triton, VPNFilter, Wannacry, and NotPetya, are considered as historically significant to ICS. Our investigation into the campaigns utilizing this malware revealed that out of ten attack campaigns, nine, excluding VPNFilter, exploited Windows systems. VPNFilter, targeting IoT devices, did not exploit Windows systems [6] [7]. Moreover, attacks targeting ICS also aim at vulnerabilities in ICS products in addition to Windows systems. For instance, Stuxnet exploited vulnerabilities in Siemens products [8], and Triton exploited Schneider Electric products [9]. However, attacks on ICS are not limited to exploiting product vulnerabilities; they also exploit specifications of industrial protocols or ICS products. Industroyer has capabilities that exploit the legitimate IEC104 protocol [10]. Thus, responses to attacks targeting ICS must encompass a broad perspective, from exploiting Windows systems to vulnerabilities and specifications of ICS products.

2. Previous Research

Takeru Naito et al. conducted research that uses IT asset information and vulnerability data as inputs and ChatGPT to output potential attack routes. While similar in inputs and approach to this study, their research focuses on identifying attack routes and threats, whereas the present study emphasizes gathering information for defense [11].

Sam Hays et al. have proposed the idea of using ChatGPT to support actions during incident responses. They input threat scenarios and obtain general overviews of responses, but unlike this study, their goal is not to derive defensive measures against specific vulnerabilities or malware [12].

Dipayan Saha et al. have published a paper investigating the use of LLMs in the Security Operations Center. It cites examples such as assessing and amending source code primarily for developers, with fewer case studies aimed at operations [13].

Yulia Cherdantseva et al. have published a review paper on risk assessment methods for ICS. The paper introduces studies that utilize attack trees, penetration testing, and attack scenarios for assessments, methods for scoring risks, and elucidating attack paths. Some studies introduce the development of measures to reduce risks specific to each ICS system. On the other hand, the objective of the proposed method is to identify targeting ports and products [14].

Uchenna Daniel Ani et al. proposed a method to calculate the impact of vulnerabilities based on the threat level of vulnerabilities, the likelihood of attacks, and the dependencies among system components [15]. Attiq Ur-Rehman et al. have proposed a CVSS framework suitable for ICS. While the threat level of vulnerabilities and the CVSS are crucial for triaging vulnerability information, they do not determine which ports and products are exploited. This study complements those on risk assessment by providing a perspective on measures [16].

3. Curation of Information on Cyber Attacks Targeting IT Systems

The purpose of this proposal is to defend IT systems and control systems from cyber attacks by extracting exploited port numbers and targeted products from information sources. This study examines the necessary information sources of information for defending ICS and how to curate these sources. As noted earlier, attacks targeting ICS also tend to target Windows systems in addition to ICS-specific products. Therefore, this research initially analyzes what information is effective for attacks related to IT, particularly those targeting Windows OS.

3.1 Evaluation of Information Sources on IT Attacks

When vulnerabilities related to IT are disclosed, sometimes Proof of Concept (PoC) codes are relatively quickly published on the Internet. Subsequently, these vulnerabilities are exploited in actual attacks within just a few days [17]. With numerous vulnerabilities being disclosed, it is challenging to analyze all vulnerability information and respond accordingly. Furthermore, if PoC is not publicly available on the internet, creating exploit tools or purchasing them from the dark web incurs additional costs. Consequently, at least in IT environments, it is not a situation where a large number of people can exploit vulnerabilities. Therefore, this study focuses on analyzing vulnerabilities whose PoC codes and attack techniques are disclosed on the internet that are particularly at risk of exploitation.

This research focuses on public reports for vulnerabilities and PoC codes published on public repositories like GitHub [18]. In addition to identifying exploited port numbers, this research also investigates the following open-source tools. These tools have modules or signatures that conform to a format; thus, we also consider the possibility of extracting port numbers without LLMs:

  • ・Metasploit: A penetration testing framework where attack code is written as modules.
  • ・Nuclei: A vulnerability scanner that publishes templates relatively quickly for a wide range of vulnerabilities [19].
  • ・Snort: A widely used IDS/IPS [20]. The Community Edition is free, but it takes 30 days to receive the latest rule set.

This research investigates which sources of information are effective in extracting port numbers for defending against and monitoring IT attacks. The investigation focuses on Windows systems and specifically targets vulnerabilities that allow attacks to be carried out easily over the network without user interaction. In this study, we referred to the Known Exploited Vulnerabilities catalog published by CISA on July 27, 2023 [21], and the CVSS Version 3.1 specification document provided by FIRST [22]. We extracted vulnerabilities that met all of the following criteria, resulting in a total of 25 vulnerabilities. The details of these 25 vulnerabilities are documented in the Table A-1.

  • ・In Known Exploited Vulnerabilities, The 'vendorProject' is 'Microsoft'
  • ・In CVSS 3.1, the Attack Vector is 'NETWORK'
  • ・In CVSS 3.1, User Interaction is 'NONE'

The research conducts an investigation into the effectiveness of the information sources used as input for this study, focusing on two aspects: the comprehensiveness of the targeted vulnerability information and the immediacy with which the information sources are published after the vulnerabilities are made public.

3.1.1 Comprehensiveness of Information Sources on IT Attacks

This study investigates whether the following freely available and widely utilized information sources are suitable for researching vulnerabilities:

  • ・Microsoft Security Advisory: Information can be searched through Microsoft's Security Advisory [23].
  • ・Vendor Report: Investigates using the Google search engine. This involves determining whether security vendors have published reports explaining the target vulnerabilities. We only focus on reports that provide an explanation of the vulnerabilities. Simple descriptions, such as those found in the National Vulnerability Database (NVD) detailing the summary of the vulnerability and affected systems, are excluded.
  • ・GitHub: Investigates using the Google search engine. We conduct searches by enclosing the CVE number in double quotes and adding “site:github.com” as a keyword to see if any relevant information appears.
  • ・Metasploit Module: Investigates using a program capable of searching modules [24].
  • ・Exploit-db: We conduct searches by enclosing the CVE number in double quotes and adding “site:exploit-db.com” as a keyword to see if any relevant information appears.
  • ・Nuclei: Investigates by searching for the relevant CVE numbers in the nuclei templates [25].
  • ・Snort: Investigates by searching for the relevant CVE numbers in the Snort community rules [26].

The results were as follows:

  • ・Microsoft Security Advisory: 25/25 (100%)
  • ・Vendor Report: 24/25 (96%)
  • ・GitHub: 23/25 (92%)
  • ・Metasploit Module: 16/25 (64%)
  • ・Exploit-db: 7/25 (28%)
  • ・Nuclei: 3/25 (12%)
  • ・Snort: 7/25 (28%)

Considering that the coverage rates for Exploit-db, Nuclei, and Snort were below 50%, we decided to exclude these sources from further investigations in this study.

3.1.2 Immediacy of Information Sources on IT Attacks

This study examines how quickly various information sources publish their findings on publicly disclosed vulnerabilities. The sources investigated include:

  • ・Vulnerability Publish Date (Microsoft): The date Microsoft published the vulnerability. The URL for each vulnerability's web page follows the format “https://msrc.microsoft.com/update-guide/en-US/vulnerability/” and ends with the CVE number (example: https://msrc.microsoft.com/update-guide/en-US/vulnerability/CVE-2021-38647).
  • ・Vendor Report: The publish date mentioned in the vendor report. For reports without a specified publish date, we check the earliest observed date from the Internet Archive [27].
  • ・GitHub: The date detailed information was posted, primarily the date of the initial commit obtained from the commit history. If the initial commit did not contain the useful information and find it in update, we obtain the update date instead of initial commit date.
  • ・Metasploit: The creation date of the module that obtained from Rapid7's website [28], the organization behind Metasploit development.

The results are shown in Table 1. As all investigated vulnerabilities relate to Microsoft products, the Vulnerability Publish Date (Microsoft) served as the reference point. The numbers in the table represent the number of days it took for each source to publish information relative to this reference date. “None” indicates no information was available, and “No information” indicates that no publication date could be found. The vendor report was the fastest to publish, followed by GitHub and then Metasploit. The average times were 39 days, 57 days, and 108 days, respectively, with median times of 11 days, 18 days, and 61 days, respectively.

Table 1 Immediacy of information sources.
Immediacy of information sources.

3.2 Extraction of Information on IT Attacks

Our investigation thus far has determined that Microsoft Security Advisory, vendor reports, and GitHub are effective sources in terms of comprehensiveness and immediacy. Using OpenAI's LLM model, GPT-4-0613 (hereinafter referred to as GPT-4), we will explore and verify the possibility of extracting port numbers from these sources. It is common for GitHub to provide PoC codes along with Readme files. According to [29], summarizing code in natural language can facilitate developers' understanding. Some repositories contain Readme files, including concise attack procedures and explanations. We hypothesized that analyzing Readme files directly with an LLM might be simpler and more accurate than analyzing code. To test this hypothesis, we will analyze Readme and PoC source code from GitHub, vendor reports, and Microsoft Security Advisory pages using GPT-4. The information extraction will be conducted from the following perspectives:

  • ・Target the earliest published source for each vulnerability. Each target URL is listed in Table A-1.
  • ・We automate the acquisition of the GPT-4 analysis target sources excluding GitHub's source code, as described in the section 3.1.1.
  • ・While the acquisition of the source code from GitHub should also be automated for practical use, this research initially aims to determine its effectiveness as a source. Therefore, we will manually review the content and retrieve PoC files that contain port number information. Although the sources used in this evaluation contain port information within a single file, GitHub's source code can consist of multiple files. In such cases, it is necessary to retrieve all files that contain port number information.

3.3 Prompts for Extracting IT Information

This subsection describes how to design the prompts. Firstly, we design prompts to solicit responses about the port numbers exploited by vulnerabilities from each information source. Next, we add the CVE ID and the Short Description from the Known Exploitable Vulnerabilities catalog published by CISA [21]. Subsequently, we transcribe the content from the following each source. Specific details of the prompts are included in the Appendix A.1. In Appendix A.1, part of written in bold should be replaced by corresponding vulnerability information.

  • ・GitHub Readme: We retrieve the content from URLs that provide access to the raw data published on GitHub. The specific URLs are listed in Table A.1.
  • ・GitHub Source Code: We also retrieve content from URLs that provide access to raw data on GitHub, detailed in Table A.1.
  • ・Vendor Report: We extract the main content of the Web page using Beautifulsoup [30] as the HTML parser and PyPDF2 [31] as the PDF parser.
  • ・Microsoft Security Advisory: We extract the main body of the Web page using Beautifulsoup as the HTML parser.

The outline of instructions for the prompts are as follows:

  • ・Ask clear answer for both TCP and UDP protocols.
  • ・Ask to respond with the port numbers that trigger the attack protocol.
  • ・Include the association between the protocol names and their corresponding port numbers in the query that are commonly exploited for Windows attacks, such as SMB or Domain Controller protocols.

3.4 Evaluation of Information Extraction for IT Attacks

The criteria for evaluating the extraction of information relevant to IT-related attacks are as follows:

  • ・For attacks exploiting multiple vulnerabilities, we judge that the response is correct if the protocol's port number that defends against the initial attack is identified.
  • ・When multiple protocols are exploited, we judge that the response is correct if the key protocol is identified. The key protocol refers to the protocol that is the fundamental cause of the vulnerability. For example, CVE-2021-42287 relates to a vulnerability in Active Directory, specifically exploiting Kerberos authentication. Therefore, the representative protocol is TCP/88.
  • ・If a vendor report or Microsoft Security Advisory explicitly mentions a port number exploited in an attack, or if it can be inferred, that port number is considered the correct response.
  • ・If PoC code cannot be found on the Internet sources, including GitHub, it is considered a True Negative. This is because if the PoC code is not published, the potential for exploitation is relatively low so that this case can be judged a True Negative.
  • ・We categorize as False Negatives the cases where PoC code is published but no information is available on GitHub.
  • ・Analysis of text described in images or videos, or audio analysis from videos, is excluded from the evaluation.

The results are presented in Table 3, with the highest correct response rate observed in GitHub Readme responses at 60.0%. Only responses that correctly and completely answered the port number were counted as correct, denoted as “o”, while incorrect responses were marked as “x”, with the correct answers noted in parentheses. For CVE-2021-34523, CVE-2021-34473, and CVE-2021-31207, the ports to be defended according to [32] are HTTP (TCP/80) and HTTPS (TCP/443). However, [33] states that only HTTPS is necessary, hence both were considered correct. For CVE-2022-26925, since no PoC code was found online, “None” was considered the correct answer. However, as Microsoft Security Advisory suggests an impact on NTLM (TCP/445) [34], responses from Microsoft Security Advisory were also considered correct.

From our findings, it was determined that GitHub Readme files yielded the highest correct response rate in our analysis. The technique of parsing GitHub Readme involved conducting Google searches for all vulnerabilities and extracting port numbers within a time frame of 2 minutes and 45 seconds. This was significantly faster than manual reading and analysis by humans. The technique for analyzing GitHub source code requires a person to identify the relevant source code files for extracting port numbers. GitHub Readme files typically contain only one per GitHub repository, offering the advantage of not needing to search for the appropriate source code files. Additionally, the correct response of analyzing vendor reports was 40%, and for Microsoft Security Advisory, it was 28%. Since reports and advisories are predominantly written in natural language, we attempted to apply the prompts based on those described in 4. However, this approach did not result in an improved correct response rate.

3.5 Discussion on Information Extraction for IT Attacks

Vendor reports are often published on the Internet shortly after vulnerabilities are disclosed, primarily focusing on vulnerability analysis. Therefore, unlike code released on GitHub that includes PoC (Proof of Concept) designed for evaluation purposes or explanations of attack procedures, these reports rarely mention the port numbers targeted in actual attacks. Vendor reports also often contain product-specific terminology. For instance, CVE-2020-0688 is described as a vulnerability in the Microsoft Exchange Control Panel (ECP) according to [35], with no specific protocols or port numbers mentioned. While it might be possible to infer exploited port numbers from undisclosed information, this would be challenging without deep security expertise. LLMs are designed not to make assumptions as a measure against hallucination. Specifically, as a countermeasure against closed domain hallucinations [36], they are trained not to generate information that is not referenced. Therefore, they may not respond to items that are not explicitly stated.

The distinction between HTTP (TCP/80) and HTTPS (TCP/443) has proven challenging. Excluding vulnerabilities directly attributable to the protocols' implementations, the impact of HTTP versus HTTPS often depends on how the target product is operated, and descriptions can vary by information source. For example, as mentioned above, [32] suggests both HTTP and HTTPS should be secured, while [33] states only HTTPS is necessary. Vulnerabilities related to OWA (Outlook Web Access) affect HTTPS, possibly because Microsoft does not recommend running OWA over HTTP [37]. However, HTTPS is indicated as exploited in the initial attack on Microsoft's site [38]. We have also conducted evaluation with test-bed and confirmed that attacks can indeed be successful over TCP/80. From this, even if a vulnerability exists in one of HTTP or HTTPS, considering both could be affected might not pose a significant issue. This evaluation focused on evaluating LLMs, distinguishing between HTTP and HTTPS, resulting in a correct response rate of 60.0%. However, from a more practical evaluation perspective, if either HTTP or HTTPS is exploited in an attack, it would be safer to assume that both are affected for defending an organization. In such cases, if answering one of HTTP or HTTPS can be considered correct, the correct response rate would increase to 80.0%. Of course, it is not always acceptable to mix them, and if the vulnerabilities are within the protocol implementation, distinguishing between them is necessary. This remains a future work.

4. Curation of Information on Cyber Attacks Targeting ICS

In this section, we shift our focus to attacks targeting ICS, building upon the effectiveness of analyzing GitHub Readme files for IT-related attacks. We investigate whether similar methodologies apply to ICS and explore more suitable approaches.

4.1 Evaluation of Information Sources on ICS Attacks

First, we examine whether the Known Exploitable Vulnerabilities list from CISA, similar to IT, can be utilized for ICS. This research focuses on products with a high share in the market [39]. Among these, only Siemens and Schneider were mentioned once in the CISA Known Exploited Vulnerabilities. Rockwell, Mitsubishi, Omron, B&R, GE, and ABB were not mentioned at all, indicating that the Known Exploitable Vulnerabilities list has low comprehensiveness for ICS.

Subsequently, we assessed the effectiveness of GitHub as a resource. The ICS advisories published by CISA. CISA ICS Advisories provide comprehensive listings of vulnerabilities for ICS products [40]. By extracting CVE numbers listed in the 2023 ICS advisories and searching for corresponding PoC codes on GitHub via Google, we found no results. This suggests that compared to IT, the publication of PoC codes related to ICS vulnerabilities is less common, rendering GitHub an unsuitable source for extracting port numbers for ICS attacks. Moreover, attacks targeting ICS may also exploit specifications [41]. Additionally, in ICS, it is often not possible to apply patches immediately. To mitigate security risks in such cases, we propose collecting information on potentially targeted ports to utilize for monitoring and access control. Given that ICS products' specifications are frequently exploited, in addition to vulnerabilities, our approach focuses on identifying the exploited products and ports rather than solely relying on patch information. Given this, our approach for attacks on ICS involves curating information from past attack instances, extracting exploited protocols, and identifying targeted products. This method aims to automate the curation process, enabling organizations to assess whether their products can be targeted or which protocols/ports should be under surveillance.

4.2 Extraction of information for ICS Attacks

To understand past incidents, we examine malware and attack campaigns featured in [5] as we mentioned in section 1.2. Further, we evaluate reports referenced as “external_references” in the ATT&CK ICS database [42] for these malware. However, due to the absence of information about Shamoon in the database, we exclude it from our evaluation. From these reports, we extract text and concatenate all the information for each malware. Given the lengthiness of these reports, they are likely to exceed the token limit GPT-4 can analyze in a single instance. Therefore, we explore using the Retrieval-Augmented Generation (RAG) technique to search for and add necessary sentences to the query for analysis. We vary the chunk size and the number of top sentences used for document search with RAG, and adopt the configuration that yielded the highest correct response rate.

4.3 Prompts for Extracting Information Related to ICS

Direct internet access to ICS is rare, and breaches often occur via IT networks or through physical intrusion. Consequently, it is more critical to defend against lateral movements rather than just the initial intrusion paths. Many vendor reports focus on breaches of Windows systems, where compromised Windows systems are leveraged for lateral movements or communicating with C2 servers, ultimately affecting ICS. Some reports also specify the names of ICS products, which can be utilized to determine if they are in use within an organization. Consequently, we employe three prompts as queries, summarized below. The detailed prompts are listed in A.2, A.3 and A.4 individually.

  • ・Prompt 1: Inquire about the TCP and UDP port numbers and protocol names that should be defended by firewalls.
  • ・Prompt 2: Ask what protocols are used for lateral movements and C2 communications.
  • ・Prompt 3: Inquire about the names of ICS products targeted in the attacks.

The background for creating Prompt 2 arose from the limitations observed with only using Prompt 1, where the responses included, in addition to ICS attack information, port numbers and protocols exploited for lateral movements and C2 communications, making it unclear what should be blocked or monitored on which network. To address this, We add Prompt 2 to clearly define the protocols exploited for lateral movements and C2 communications.

Moreover, Prompt 1 includes not only port numbers but also protocol names in the responses, and Prompt 2 instructs to answer with protocol names. This approach is due to the following reasons:

  • ・Protocols used in ICS often include ports beyond the well-known ports (number 1024 and above), and relying solely on port numbers is not intuitive.
  • ・Lateral movements and C2 communications may exploit VPNs, which vary in port numbers depending on the VPN products, hence the necessity to specify the protocol name.

4.4 Evaluation of Information Extraction for ICS Attacks

The evaluation of information extraction for ICS-related attacks considered the following criteria:

  • ・An answer is considered correct if either Prompt 1 or Prompt 2 successfully identifies ports or protocols that were exploited in an attack and need to be defended or monitored.
  • ・Notational variances (e.g., IEC 104 vs. IEC-104) are accepted as correct answers.
  • ・Duplication of protocol names and port numbers (e.g., SMB and TCP 445) is accepted as correct, provided at least one of them is correctly answered.
  • ・Since Prompt 2 aims to supplement Prompt 1, overlaps between “ports and protocols to be defended or monitored by firewalls” and “lateral movement, C2” are accepted as correct (for example, if SMB is answered for protocol name and also answered in lateral movement).
  • ・If an exploited protocol is not explicitly mentioned in the report, its absence in the answer is considered correct (e.g., Communications with onion domains such as “example.onion” are used for C2 servers, but the use of Tor is not explicitly mentioned.)
  • ・Information described or commented on in images or videos is excluded from extraction targets.
  • ・If IoT devices or network devices are explicitly identified as targets in the report, they are considered correct answer. The purpose of this proposal is to clarify the targets of past attacks, making it easier for readers to assess whether their systems could be affected. Therefore, network devices like IoT devices and VPN connections, which tend to be implemented in modern ICS environments, are also considered correct.
  • ・Correctness was determined by correctly identifying all aspects: port numbers, protocol names, lateral movement, and C2, and the correct response rate was calculated based on these criteria.

For the evaluation process, we varied the chunk sizes to 200, 500, and 1000, and adjusted the number of top sentences used for document search in RAG, denoted as Top_k, to 20, 30, and 40. The results, as depicted in Table 2, indicated that the optimal performance was achieved 81.8% in port and protocol extraction and 72.7% in target product extraction with a chunk size of 500 and the top 30 documents for document search.

Table 2 Comparison of varying chunk sizes and Top_k values.
Comparison of varying chunk sizes and Top_k values.

Table 4 shows breakdown of information extraction under the conditions Chunk size of 500 and Top_k of 30 is as follows:

The total time taken for information extraction on ICS-related attacks was 8 minutes and 9 seconds. The breakdown is as follows:

  • (1)Downloading and parsing the report took 2 minutes and 36 seconds.
  • (2)Segmenting the text-based report into chunks and extracting the top 50 chunks relevant to the prompt required 4 minutes and 12 seconds.
  • (3)Formulating questions for GPT-4 and receiving answers took 1 minute and 21 seconds.

4.5 Discussion on Information Extraction for ICS Attacks

Analyzing vendor reports on vulnerabilities as presented in the subsection 3.4 and actual incidents as outlined in the subsection 4.4, the method for eliciting exploited port numbers achieved accuracy rates of 40.0% and 81.8%, respectively. This discrepancy can be attributed to the analysis of a single report for IT, whereas multiple reports were consolidated for ICS, providing a richer information set. Additionally, the reports used in the IT evaluation primarily focused on explaining the mechanisms of vulnerabilities, whereas those used in the OT evaluation analyzed real incidents, emphasizing explanations of malware and threat-actor activities.

If the detected product is embedded in other products or sold under OEM agreements, there is a possibility of oversight using the proposed method. To prevent this, it is necessary to verify with the manufacturer or the supplier whether the products in your organization use the specified product. However, the reports targeted in this study mainly exploit products with clearly defined company and product names, rather than targeting vulnerabilities in software PLCs or modules integrated into many products. Therefore, this method is considered effective in such cases.

As mentioned in section 4.1 it is believed that PoC exploits are less frequently published for ICS. A search for PoC related to ICS, not only in Known Exploitable Vulnerabilities or GitHub, yielded very few results. Additionally, attacks exploiting ICS specifications are a notable characteristic. Therefore, using the publication of PoC as a predictive indicator of attacks, as is often seen in IT, is not applicable to ICS. Attacks on ICS are not as universally applicable as those on IT systems. Moreover, intrusion is not as easy as in IT, and attacks are not observed immediately after information is disclosed. Often, attackers spend a long period on reconnaissance, customize their attack tools, and even after infiltrating the organization, it takes a significant amount of time to successfully affect ICS [43]. Therefore, identifying whether your organization matches environments targeted in past attack cases, while not immediate, is expected to be somewhat effective for ICS. Although it may not excel in predicting attacks, we adopted an approach that uses reports summarizing past attack cases as lessons learned for ICS. Nevertheless, predicting attacks remains a challenge, and we aim to investigate indicators and information on attacker trends that could predict attacks in ICS as well.

5. Utilization of the Proposed Method

One of the advantages of the proposed method is that it facilitates the comparison of your organization's environment with previous attack cases by extracting port numbers related to past Windows OS vulnerabilities and port numbers or product names exploited in ICS attack cases. Port numbers and product names do not change frequently compared to software versions and patch statuses, making them relatively easy to manage. Therefore, it is expected that many organizations will find it easy to perform these comparisons. Additionally, port numbers are simple and have little variation in notation, making automation easier. Of course, this information alone cannot prevent all attacks. The first step is to use the proposed method to compare your organization with past cases and assess the potential impact. If it is determined that your organization might be affected by an attack, a more detailed investigation should follow. The proposed method contributes to the initial step of security measures by learning from past cases.

In the context of ICS security, it is particularly important to verify that legitimate communications do not include unauthorized write operations, as specifications are more frequently exploited in ICS than in IT environments. For example, an attacker might use legitimate communication channels to alter values. In such cases, it is crucial to monitor the content of the communications, ensuring there are no write operations to variables or tags that are typically not modified, and that values being written do not deviate significantly from usual levels. Since ICS tasks are often fixed, it is recommended to allow only specific protocols and use only certain accounts [44]. This allows for easier detection of unusual protocols or activities and attempts to access the system using accounts that are not normally in use. Using the proposed method, if products or port numbers corresponding to ICS attack cases are identified, specific follow-up actions to be taken are described in detail in the following subsection.

5.1 Usage Example

The extracted information outputted by the proposed method (Table 3, Table 4) can roughly determine whether the environment is similar to past ICS attack cases and whether ports exploited by existing Windows vulnerabilities are open. Therefore, if there are matches with your organization's environment, further investigation should be pursued. This proposed method automates the first step in this process.

Specific usage examples of the proposed method are as follows:

Table 3 Result of Curation for IT.
Result of Curation for IT.
Table 4 Result of Curation for ICS.
Result of Curation for ICS.
  • ・It is assumed that the open port numbers and the names of the ICS products being used are known as part of asset management. Additionally, this method should be executed periodically to extract information from new vulnerabilities and reports.
  • ・As indicated by attacks on IT environments, Windows OS is prevalent in IT environments. Using Table 3, check if any ports exploited by Windows OS vulnerabilities are open. If they are open to the internet, there is a risk of initial intrusion. Even if they are not open to the internet, if they are open at the network boundary, it suggests the possibility of lateral movement across the network exploiting those ports.
  • ・Regarding attacks on ICS, use Table 4 to check if any ports used in ICS are open or if any ICS products are being used.

A more specific example of usage is explained in a network depicted in Figure 1, which simplifies the IEC 62443 zone model. If new vulnerabilities exploit RDP (TCP/3389) or SMB (TCP/445), verify if the deployed Windows computers have any vulnerabilities. Additionally, if new reports indicate the exploitation of Modbus/TCP (TCP/502) or attacks targeting Siemens S7-1200, consider implementing additional measures as described below.

Network for usage example.
Fig. 1 Network for usage example.

If the relevant ports are open or if the deployed products could be potential targets within your organization, consider the following:

  • ・Evaluate whether it is truly necessary to keep the targeted ports open.
  • ・If it is necessary to keep the targeted ports open, review whether the corresponding protocols have authentication features enabled and if the authentication credentials are not easily guessable. For example, review authentication credentials for SMB or consider replacing OPC-DA with OPC-UA, which has more robust authentication and authorization features. This is important as some types of attacks require authentication credentials.
  • ・If it is necessary to keep the targeted ports open, enhance monitoring from the following perspectives:
    • (1)Monitor the targeted ports themselves. For instance, check for unusual communication during non-typical hours or large volumes of communication, which might indicate abnormal behavior. In the case of ICS, verify that there are no write commands to variables or tags that are usually not written to.
    • (2)Strengthen monitoring of hosts with open targeted ports or those presumed to be affected by vulnerabilities. Ensure that there is no communication with unusual computers or with domains and IP addresses listed as indicators.
  • ・If patches have been released for products mentioned in the reports, consider applying the patches.

6. Conclusion

For IT systems, analyzing exploit codes such as that found on GitHub using LLM, allowed for relatively quick analysis after vulnerability disclosure, and further analyzing Readme files enabled the extraction of exploited port and protocol information with a comparatively high accuracy rate. Although ICS analysis lacked the immediacy of analyzing information, analyzing multiple reports resulted in a higher accuracy rate than IT. Among the vulnerabilities, some are established by connecting from the target to the attacker's server after the initial exploit. In that case, making it challenging to extract the inbound port numbers that should be defended. For instance, CVE-2023-23397 involves sending an initial attack email, with the attack being established by the recipient sending authentication information to the attacker's server. In this evaluation, only the exploited SMB (TCP/443) could be extracted. Therefore, defense and monitoring might also need to consider the direction of communications; this proposal method has not yet considered these aspects adequately. Developing a method to accurately indicate communication direction will be part of future work. The following points will also be addressed as future work:

  • ・The difficulty of automatic detection due to variations in product names.
  • ・The insufficient differentiation between HTTP and HTTPS.
  • ・Establishment of Methods for Predicting Indicators of Attacks on ICS.
  • ・Leverage LLM for further steps, such as proposing more specific defense methods and efficiently equipping AI with specialized knowledge.
References

Appendix

A.1 Prompt for IT

Read github README file and answer the following questions. What is the port number should be blocked from cyber attacks that abuse vulnerabilities.

The following are important to read vulnerability description and github README file:

- Carefully read around keywords “Usage” and “Port” because these relate to attack ports to be blocked.

- Blocking direction is very important, so carefully read the inbound and outbound port to protect our system. - “Inbound” means direction from thread-actor to victim. - “Outbound” means direction from victim to thread-actor.

- If you can not determine the direction to block from given documents, you should answer them as an inbound port to block.

- SMB (Server Message Block) and NTLM (NT LAN Manager) authentication strongly relate to TCP port 445.

- Send request with “http” or “https” strongly relate to TCP port 80 or 443.

- Active Directory server could be described as DC or Domain Controller, relate to TCP port 88.

- RPC (Remote Procedure Call) relate to TCP port 135 and 139. RPC could to be abused. DCOM also use RPC.

- Sometimes multiple CVE ID (ex. CVE-2020-1234) are given. In that case, focus on only designated CVE ID.

- If there is URL and access to image file such as jpg, png or gif, you should ignore them.

- If there are reference or resource section, you should ignore them.

- You should ignore high port from 1025 to 65535 as a ports to be blocked.

- You should ignore port number that is abused as post exploitation.

ONLY answer related port number with JSON format like the following:{”TCP”:”[80, 443, 3389]”, “UDP”:”[]”}. If you can not access to the URL or no information is given, you should leave them blank like the following: {{”TCP”:”[]”, “UDP”:”[]”}}.

The following are CVE ID and vulnerability description.

CVE ID

VULNERABILITY DESCRIPTION

The following are github README contents.

GITHUB README CONTENTS

A.2 Prompt for ICS (Prompt1)

The CONTENTS could contain ICS related TCP port, UDP port number and protocol name. These port and protocol should be blocked from cyber attacks. Carefully tell difference between TCP and UDP and answer them. If no information about TCP or UDP are given, you should answer port number as TCP. Carefully answer protocol name. Do not answer hacking tool name. ONLY answer that port number and protocol name with JSON format like the following: {”Port”:{”TCP”:[“XXX”], “UDP”:[], “Protocol”:[“YYY”]}}. If not relate to cyber attacks, leave them blank like {”Port”:{”TCP”:[], “UDP”:[], “Protocol”:[]}}.

--- CONTENTS ---

CONTENTS

A.3 Prompt for ICS (Prompt2)

If the CONTENTS contain SMB, RDP, RPC, etc. for lateral movement (internal movement), answer the protocol name. If the CONTENTS contain VPN, Tor HTTP, HTTPS, etc. for C2 (Command and Control) access, answer the protocol name. Carefully answer protocol name. Do not answer hacking tool name. ONLY answer protocol name with JSON format like the following:{”lateral_movement”:[“SMB”, “RPC”], “C2”:[“HTTP”,”HTTPS”]}. If not contain the protocol name, answer only {”lateral_movement”:[], “C2”:[]}.

--- CONTENTS ---

CONTENTS

A.4 Prompt for ICS (Prompt3)

Does the CONTENTS contain ICS manufacturer product as the attack target? If yes, which ICS vendor product likely relate to? ONLY answer vendor product name with JSON format like the following:{”Target”:[“XXX”, “YYY”]}. If not relate to ICS vendor product or no specific vendor product name, leave them blank like {”Target”:”[]”}.'

--- CONTENTS ---

CONTENTS

A.5 Reference Page for IT

A.6 Reference Page for ICS

Table A.1 Reference page for IT.
Reference page for IT.
 
Table A.2 Reference Page for ICS.
Reference Page for ICS.
Wataru Matsuda
w.matsuda.506@stn.nitech.ac.jp

Wataru Matsuda joined NTT WEST, Ltd. in 2006. Now, he engages in research and analysis in NTT Social Informatics Laboratories. He is also currently a Ph.D. student at the Department of Architecture, Civil Engineering and Industrial Management Engineering, Nagoya Institute of Technology, Japan.

Mariko Fujimoto
mariko.f@csirt-tc.org

Mariko Fujimoto is a security analyst at NEC Solution Innovators, Ltd., working on security diagnosis, penetration tests, etc. She is a concurrent Project Researcher at Toyo University and is engaged in research on the cyber security of Active Directory and education. She is also a Ph.D. at Nagoya Institute of Technology and is engaged in Industrial Control Systems cyber security.

Takuho Mitsunaga
takuho.mitsunaga@iniad.org

Takuho Mitsunaga is an Associate Professor at Toyo University. He is also a senior fellow at The Tokyo Foundation for Policy Research and a security expert at Information-technology Promotion Agency in Japan. After completing his degree at Graduate School of Informatics, Kyoto University, he worked at the front line of incident handling and penetration testing at security organizations.

Kenji Watanabe
watanabe.kenji@nitech.ac.jp

Kenji Watanabe is a professor at the Nagoya Institute of Technology, specializing in risk management, business continuity (BCM), and critical infrastructure protection (CIP). With nearly 20 years of experience in financial business and risk management, he has worked at Mizuho Bank, PwC, and IBM. He serves on governmental committees, including the Critical Infrastructure Protection Council, and leads Japan's delegation for ISO/TC292. Notably, he led the JICA-sponsored Area-BCM project in Thailand. He holds a PhD from Waseda University and an MBA from Southern Methodist University.

受付日2024年5月13日
採録日 2024年8月27日

会員登録・お問い合わせはこちら

会員種別ごとに入会方法やサービスが異なりますので、該当する会員項目を参照してください。