To follow along with this analysis, you can find the sample at your favorite malware site with a SHA256 hash: 6ad9458d9b42b010846a23d3defc7d770ad91ce5f720581c27ab96aa95edef4b.
This is a live sample that is found in the wild, please be careful and only unzip the above files in a closed environment.
Background
I came across a highly intriguing incident involving the deliberate targeting of embassies by an Advanced Persistent Threat (APT). APTs are typically considered state-sponsored due to their extensive resources and ability to leverage new techniques. The incident is related to the notorious APT29 NOBELIUM group, also known as CozyBear. After reading a collection of publicly available blogs, I felt that the research on APT29 was missing detailed information. I took it upon myself to conduct additional research into the individual components of the sample and identify additional indicators of compromise. My goal is to provide an in-depth analysis on each component of the attack chain and help identify IOCs for future detection builds. I look forward to bringing you along on this journey!
What is APT29 ?
NOBELIUM or APT29 is thought to be linked to the SVR(the Foreign Intelligence Service of the Russian Federation) as investigators reported, also it was reported that APT29 has a very long history of attacking and targeting governmental organizations, non-governmental organizations, and many others. Some of their notable hacks include the SolarWinds Supply Chain Attack and Operation Ghost. APT29 is known for using new exploits and unique malware for each campaign which makes them interesting to study.
What is HTML smuggling?
Microsoft Threat Intelligence Center (MSTIC) has disclosed a new evasive malware delivery technique that leverages HTML5 and JS. This technique is described as HTML smuggling.
HTML Smuggling gets its name from the fact that the malware developers choose to smuggle or obfuscate malicious JavaScript based code within an HTML email attachment. The attackers then deliver their malware in a very targeted spear-phishing attack.
To read more about the HTML Smuggling attack and the most common incidents reported using the HTML smuggling technique to kick off their malware infection chain, you can look at this very nice blog at MalwareBytes which details this technique and the best security practices to mitigate HTML Smuggling attacks within your environment.
Technical In-Depth Analysis of the Infection Chain
The infection chain starts with a weaponized disk image file (covid.iso), that embeds a malicious HTML application file (HTA). The HTA contains malicious JS code, which we'll look at later. The disk image is typically delivered via a spear-phishing campaign as an attachment.
Initial infection: Malicious HTA
Upon mounting the attached ISO disk image, it drops an HTA file named "Covid.HTA", which when executed, malicious JS code embedded within will be executed.
We first notice that the program creates a WSH (Windows Script Host) object called "Wscript. Shell", which allows you to execute and run commands, run programs, and manipulate the system registry hive. We can see that the program is using this object to write two entries of type "REG_SZ" to the registry key "HKEY_CURRENT_USER\SOFTWARE\JavaSoft". One of these entries is just a simple powershell loader, that will load the encoded shellcode written to the other registry entry to set up for the next stage in the infection chain.
Moving forward we see that it gets the contents of a hidden HTML element with the ID " P1 " this is just a script of base64 encoded Unicode text which is written to "HKEY_CURRENT_USER\SOFTWARE\JavaSoft\Ver".
res = document.getElementById("p1").innerHTML;
a.RegWrite("HKEY_CURRENT_USER\\SOFTWARE\\JavaSoft\\Ver", res, "REG_SZ");
This is the untranslated version of the above described base64 code.
res = document.getElementById("p2").innerHTML;
a.RegWrite("HKEY_CURRENT_USER\\SOFTWARE\\JavaSoft\\Ver2", res, "REG_SZ");
As we continue to analyze the sample, we can see that the same thing happens for the other hidden element with the ID "P2". It gets the contents of the element and then writes it to HKEY_CURRENT_USER\SOFTWARE\JavaSoft\Ver2.
As we'll see later this is just an encoded shellcode program that will prep for the next stage of the malware infection chain.
Lastly, it builds and executes the following command to execute and load the shellcode written to Ver2 through the PowerShell loader written to Ver.
With a little bit of text formatting and spacing, we are able to properly see that the decoded PowerShell just allocates virtual memory and writes the decoded shellcode into then execution is passed on to it.
Another thing note is that at the very beginning after program decodes the shellcode, it removes the registry entries containing both the shellcode and the PowerShell loader. I believe this is an anti forensics procedure as the program leaves no system artifacts behind for investigation.
Before we get into the details of the decoded shellcode and how it's used to stage up for the next stage of the infection chain, it is worth noting the technique leveraged to execute in-memory malicious code(Memory-Only Malware) without it touching disk using native system tools which is our case PowerShell is referred to as LOLbin.
Living off the Land binaries (LOLbins) have the advantage of leaving very minimal system artifacts/indicators of compromise (IOCs) behind for investigation due to the program being executed in memory without touching the disk.
Shellcode Analysis
As I mentioned before, the PowerShell loader written to the registry serves the sole purpose of loading in memory shellcode that will stage the next phase of the malware infection chain.
With this in mind let's go ahead and pick it up from there and dive into the Shellcode itself.
I used base64 to decode the shellcode written to the registry, the same way we did before using CyberChef, then saved it to a file to later examine in IDA.
Initially, when you open up the shellcode file in IDA, you're confronted with a message that the program doesn't recognize the shellcode as a valid PE file, or rather, as one of the executable file formats that IDA supports, and it just, therefore, leaves you with the only option of interpreting that file as RAW binary.
Instantly when we let IDA do its magic and disassemble the file, we see a call to an interesting function that takes five arguments (sub_31).
To demystify what those arguments are and what function does, let's peek into that function.
At the very start of the function, we see the usual shellcode stack string builds.
In case you don't know what that is and that's your first time seeing it, stack string is a common shellcode string obfuscation mechanism thereby malware strings can be built as needed on the go at runtime instead of just hard coding them in. Fortunately, IDA does a pretty good job at building those up and representing them nicely in the decompiler output.
As we analyze the decompiled code, we notice a function named sub_B11, that takes a large hex constant.
This as a function that dynamically resolves function addresses by hash.
Yet another common malware obfuscation technique that tends to complicate the analysis process is dynamic API resolution, where functions are not resolved at load time but rather at runtime, and this is just so to hide identifiable function names from the Import Address Table (IAT).
Now let's take a look at that function, at the very start of the function we see a reference to the FS:0x30 register access, FS is not a dedicated purpose register but rather one of the segment registers that gets its purpose from the OS, in our case it points to a thread data structure which is the TIB.
I am not going too much into the details as you can find tons of other resources that already explain the use of the FS register, and how it's leveraged to access key system structures very well. The resources on the internet describe how the FS register is used to access key structs such as TIB , PEB , which store current process of thread information which is needed by malware for a variety of reasons including identify that it's being debugged , walk list of loaded modules and other
The main thing to understand here is it's getting the address of another data structure which is called the process environment block (PEB), which holds user-mode accessible information about the currently running process including the list of loaded modules, process startup arguments, and others.
Having obtained the PEB pointer, the program walks the list of loaded modules, hashes the names, and then breaks on the first module with non-zero exports. It turns out that NTDLL.DLL is the first DLL to be loaded right after the executable image, therefore it ends up having a base pointer to NTDLL.DLL.
You might be asking, "Why is the program checking for the number of exported entries?"
My thinking is it's looking for the hash of the first DLL loaded, as only DLLs are supposed to expose functions to be imported by executable images.
The hashing algorithm used to generate hashes for module names is as simple as the standard ROT13 cipher with the letters converted to uppercase.
Below is a PoC for the hashing algorithm, that calculates the hash for ntdll.dll.
#include <stdio.h>
#include <wchar.h>
#include <string.h>
#include <stdint.h>
#define ROR32(value, bits) ((value >> bits) | (value << (32 - bits)))
int calc_dll_hash(const wchar_t* name)
{
uint32_t running_hash = 0;
size_t length = (wcslen(name) +1 )* 2;
uint32_t rot = 0;
char c;
for (size_t i = 0; i < length; ++i)
{
rot = ROR32(running_hash, 13); // ROT13
c =*(char*)((char*)name + i);
if (c < 0x61)
running_hash = c + rot;
else
running_hash = c - 0x20 + rot;
}
return running_hash;
}
int main(int argc , char** argv[])
{
int hash = calc_dll_hash(L"ntdll.dll");
printf("0x%x\n",hash);
}
The same hashing algorithm is used to calculate a hash value over the exported function name (standard ROT13).
Having calculated the hash for NTDLL.DLL as well as the exported function name, the hash values of both of these programs are added and then compared to the passed-in hash value.
Mainly the target function whose address is to be resolved is represented as the sum of the two hash values of the target DLL as well as the function of interest.
Discussion
What stood out for me is that the function accepts a single hash value which in my experience is a little odd, as mostly we see two hash values passed in, one for the DLL and the other for the function within the DLL.
However, in our case, the developers chose to represent the target function to lookup by hash with a single hash value.
That wraps up the getAddrByHash, which returns the address of a target function by hash, very much similar to the Windows API function GetProcAddress. GetProcAddress resolves a function address at runtime given the name of the function.
It turns out that the function whose hash value is 0xBDBF9C13 is ntdll!LdrLoadDll which is the NTDLL equivalent to kernel32!LoadLibrary, which is used for dynamic DLL loading.
The other hash value is LdrGetProcedureAddress.
This is a pretty standard shellcode dynamic function address resolution sequence.
One thing worth noting here is the fact that the authors decided to go with the NTDLL equivalents of LoadLibrary and GetProcAddress functions, probably they're trying to avoid the use of common functions that are likely to be hooked by sandbox solutions.
Early on in our discussion, we saw that the main shellcode function took a few arguments which we didn't have any clue at the time what they are, however, we ecognized from the way the first argument is used in the context and the offsets added to, that it's some module base address.
Mainly, the rest of the function involves using dynamically resolved functions like VirtualAlloc and VirtualProtect to properly map that module in memory. Then, the execution is transferred to an export within that module.
You may be wondering, "How do you know that the module referred to is a DLL?"
As I read through the rest of the function, I noticed that a pointer was obtained to the export directory of the module, and a loop was set up to iterate through the exported functions. The function then hashed the names using the same ROT13 cipher. We compared them against the weird hex value that was passed into the function as the second argument.
It turns out that the hex value is a hash of one of the functions exported by the DLL.
In the next part of this series, we will focus on dynamically analyzing the extracted shellcode to dump the DLL to the disk. Then we will perform a static analysis of the DLL to examine the target export function of interest and see how it is used to set up for the next stage of the infection chain.
Thank you very much for joining me on this detailed analysis of a sample attributed to APT29 - NOBELIUM.
My name is Mohamed Talaat and I go by code names DTM, and Blu3Eye on my socials. I am a Computer Engineer with a Bachelor in Computer Engineering from Suez Canal University,(Ismailia, Egypt). Even though I don't come from a strong cybersecurity background, I took it upon myself to build up a career in cybersecurity. I started as a Cyber Security generalist, I did a little bit of pen-testing and experienced using different tools such as Nmap, Metasploit, and Burb, after much thought I found myself a better fit in Blue Teaming and malware analysis. I do malware analysis and development of TTPs and I write detection rules as part of my on-daily-basis routine. You can find me on LinkedIn and my website is below.