Searching the leaked correspondence of Russian officials. Basic tutorial
After Russia has launched its full-scale invasion of Ukraine, various hackers started breaking into the mail servers of Russian organizations and releasing their contents to the public. Given mounds of data to sift through, how is it possible to find any meaningful information in there?
For this purpose, we modified Aleph, a software environment for document indexing, and while running it in test mode we uploaded almost two dozen such "leaks" from Russian companies and governmental bodies to it. This text explains how to operate this system. You can find a link to the tool itself at the end of the article.
Translated by Dmitry Lytov, Mike Lytov
Why search someone else's correspondence?
There are no secrets in the digital world. Most users do not protect their mailboxes sufficiently to prevent them from falling prey to hacker groups. At the same time, correspondence can contain a lot of important data about personal life, commercial projects, tax evasion, shadowy schemes, passport data, flights, real hotel expenses, entertainment, and control over businesses that is not available to the public. Hackers periodically gain access to such correspondences.
Interference with someone else's correspondence is illegal. However, if the information is already in the public domain and is of public interest, the skill of processing such data can be useful for journalists and activists.
Hackers periodically dump into an open or semi-open access the databases of emails from hacked mailboxes and servers of political figures, corporations, and government organizations. However, this data often cannot fit on the computer of an ordinary user and can easily disappear from the network. In addition, it is almost impossible to manually work with thousands of files.
One way to facilitate the analysis of such "leaks" is to use the software Aleph, which was developed by the Organized Crime and Corruption Reporting Project (OCCRP) team. It stores all the necessary information, recognizes important information in the text and allows users to quickly search for information among different files and data sets. The program was developed based on the “follow the money” principle to help investigative journalists look for matches in the data of business registers, real estate, "leaks" of financial documents, etc. Aleph knows how to recognize information related to financial instruments: account numbers, telephone numbers, e-mails, company names, etc. We decided that large datasets from mail servers can also be indexed in this way.
Some of the "leaks" were already processed by the media, and some were only mentioned as a single line in news reports. Therefore, in our opinion, the community of researchers and investigators may be interested in processing such documents in this format.
Datasets we make available through Aleph:
Neocom Geoservice — letters from a Russian engineering drilling company;
Transneft— correspondence of the R&D department of the Russian state oil-producing company;
MashOil — correspondence of a Russian company engaged in the creation and maintenance of drilling equipment;
Aerogas — correspondence of an oil and gas production company;
Gazregion — correspondence of a company engaged in oil and gas pipelines in the Russian Federation;
VGTRK (the All-Russian TV and Radio Company) — almost 1 million letters from the Russian state broadcaster;
Ministry of Culture of the Russian Federation — 230,000 e-mails;
Rosatom — 15 GB of files belonging to the Russian state mega-company that controls nuclear activities;
Diaconia — almost 60,000 letters from the charity department of the Russian Orthodox Church, which is responsible, among other things, for work with refugees (Investigation material);
German Chambers of Commerce — documents of Alexander Markus and his wife, Russian citizens and FSB agents, regarding business cooperation in Russia and Germany;
Khava-d — correspondence of Dmytro Khavchenko, a militant of the "DPR" with the nickname "Sailor", who was probably related to the crypto business and the laundering of money through it by the special services of the Russian Federation (part of this story, for example, here);
NZF_DNR_feb2017.tgz — files related to "DNR" connections;
ROSOBORONEXPORT — documents of Russia's leading arms supplier (discussed in the Russian media);
Sberbank of Russia — files of the translation bureau of the Russian Savings Bank;
Mosekspertiza — more than 150,000 letters and 8,200 files of a state-owned company that provides business consulting and evaluation services;
Continent Express — files and databases of a Russian travel company serving business and government in Russia;
Worldwide Invest — investment company correspondence;
Sawatzky — correspondence of a company engaged in real estate management in Russia;
Accent Capital — correspondence of a company that invests in real estate in Russia;
Tendertech — letters from a company engaged in the processing of financial documents in Russia;
Capital Legal Services — files of a Russian legal company;
Marathon Group — correspondence of the company of Alexander Vinokurov, son-in-law of Russian Foreign Minister Lavrov;
Gazprom Linde Engineering (in progress) — correspondence of a joint Russian-German enterprise engaged in engineering solutions for the oil and gas industry;
Technotec (in progress) - correspondence of the company that provides technical services to Rosneft and Gazpromneft in Russia.
What is important to know about the leaks before starting to work with them?
First, unless you have access to the inbox of a top official or a person / group of people who actually make decisions, the chance of getting a sensational data set right away is very small.
Most likely, a "dirty" dump will contain a lot of advertising spam, bills for storage, and if the dump belongs to a large company, then it will contain endless "redirections", greetings, monitoring, reviews, etc. Of course, these can also be the subject of research, but you should be ready to look for a needle in a hundred of haystacks. It is better to roughly outline for yourself a list of keywords/information that you plan to search for, from which you can start. Then build hypotheses, even if most of them turn out to be wrong in the end. Here's an example of what "keywords" might look like to find stories:
— Ukraine / Ukrainian, DPR / LPR / Crimea, Kyiv (and all the names of Ukrainian regional centers in variations), surnames of key persons, government officials, deputies, high-ranking collaborators, etc.
Be ready to look for a needle in a hundred of haystacks
Secondly, even if a person / organization is reckless enough to discuss through e-mail their espionage schemes, criminal plans or financial devices, deception of clients or contractors, etc., the inbox that was leaked is not necessarily the main one from which such a discussion might be conducted. There is no single rule: "regular business" can be discussed from a private inbox to avoid being monitored by the corporate email security service. Or it may be the contrary: the topic will only circulate on corporate servers, to avoid going beyond the boundaries of the organization. Which of these categories—main or secondary got into the "dump" is entirely up to chance. It is worth adding to this point that even in the case of such recklessness, the subject of the "scheme" can be encrypted. For example, "ship me 5 kg" can mean both a real 5 kg of something, or 5 thousand / million in money.
Thirdly, email attachments can contain both useful documents (presentations, receipts, tickets, etc.) and spam or viruses.
The most risky are PDF, DOC and Excel files. Therefore, one should be very careful when opening attachments from other people's mailboxes. Or better yet, don’t open them at all - Aleph allows you to read the contents of files without opening them on your computer. This does not guarantee 100% security, but it eliminates most of the potential routes for a cyber attack.
Fourthly, it should be noted that the downloaded correspondence will refer to a certain time period and is not necessarily the most recent. And, of course, it is not updated, unless the hackers tweaked the victim’s mailbox to forward all messages to an external server. This means that the story you find in the correspondence can start in the middle or end “halfway through a word”.
How to sort letters from strangers?
One of the ways to sort letters from those leaks is to use the Aleph software environment. OCCRP investigators work in it with logs and leaks of documents like the Panama Papers, looking for connections between tens of millions of figures around the world. Based on this we assumed that the software would be able to help us sort through such a disorganized mass of data as the leaked correspondence of a number of Russian people and companies.
Here is what came out of it
We ran our own version of Aleph on our server in an experimental mode, and we uploaded the leaks of mail from Russian officials and businesses. Some of them have already received detailed analysis in the media, and some were only briefly mentioned as proof of the hacking of the mail of a certain organization.
Among the interesting datasets are the correspondence of employees of VGTRK, the main propaganda television and radio company of Russia, several investment and legal firms, real estate management companies, including Marathon Group, a company owned by Aleksandr Vinokurov, the son-in-law of the Minister of Foreign Affairs of Russia, Sergey Lavrov. Our experiment also included some correspondence of the synodal charity department of the Russian Orthodox Church, the Ministry of Culture of Russia, etc.
At this stage, we do not resort to searching for specific stories but showing what exactly the program "can do".
Its main advantage is the independent "identification of information" that the program finds within a large array of documents (in particular, document scans) and structures.
Among the types of data that the program selects for analysis:
emails
documents (doc files)
presentations
images
tables
people, web pages
video files
audio files
text files
planned events (Zoom meetings, Google calendar events), etc.
Entities that "Aleph" "recognizes" automatically:
phone numbers
names, surnames
bank accounts
email addresses
postal addresses
Such data can be obtained already at the first analysis of the database by the program. What's more, Aleph sorts mentions of certain information and lets you understand how often a certain email address, phone number, etc. was used. Of course, the frequency of mentions does not necessarily indicate the importance of a particular character.
For example, the phone number of the office can simply be recorded in the auto-signature of all its employees, or all incoming documentation goes through the secretary's e-mail. Similarly, a large number of service requests will most likely pass through the mailboxes of the IT department.
However, even the first "repartition" of the data can give certain clues. For example, two Cypriot numbers catch the eye in a dump of letters from the investment company "Accent Capital" on the list of the most frequently mentioned phone numbers in the correspondence. It can be assumed that this indicates relatively intensive correspondence with a certain Cypriot client or contractor (further analysis confirms this). In this case, this is nothing unusual for an investment business. However, other, more exotic phone codes could lead to a more interesting story.
Also, the first partition shows the most intensively used e-mail addresses and names mentioned in the correspondence.
The entire list of mentions of this entity in our dataset is "hidden" behind each of them. At the same time, it is important that Aleph is quite sensitive to variations of names and surnames. Therefore, it will treat "Vladimir Zelenskyi", "Vova Zelenskyi", "V. Zelenskiy" and "Volodymyr Zelenskiy" as different entities.
This should be noted when performing a basic search, especially when it comes to foreign surnames and names whose spelling in Cyrillic (or vice versa - the spelling of Russian and Ukrainian names in Latin characters) is not obvious and can be recorded in multiple variations.
Basic search
A basic database search should be performed using the search bar. If at the initial stage you do not have a clear understanding of what information you are looking for, you can try to work with the simplest hypotheses and enter words related to them. However, remember that a search query for "Ukraine" will return not only mentions of Ukraine in correspondence, but also reservations at hotels with that name. And the surname "Zelensky" can refer not only to the Ukrainian president, but also to other people who are not related to him in any way. Therefore, the clearer the request, the better.
Advanced search settings allow you to search for a part of a word, exclude certain variants of the results from the search task, search with a change of several letters (which is useful when searching for men and women with the same last name, in which 2-3 letters change in the ending).
For example, we will search for the name of Aleksandr Vinokurov, owner of Marathon Capital, son-in-law of Russian Foreign Minister Sergei Lavrov. The result makes it possible to find out several options for his e-mail (corporate and personal), as well as to look at letters where he is the sender or addressee. We also find several mentions of the surname and first name in the mail of one of the assistants, who receives monitoring messages from the search engine about the mention of the boss in the media.
Another example. Using the keyword "DNR" in the "Diaconia" dataset, you can find references to documents on the distribution of refugees in the regions of Russia (this activity was coordinated by the Russian Orthodox Church with the Ministry of Emergency Situations of Russia, which was reported in April by journalists from Slidstvo based on the analysis of this dataset).
Comparison with other datasets
Aleph includes the so-called "investigation mode”: simply speaking, it compares multiple datasets and finds matches between them. At the same time, the datasets the comparison is done with do not necessarily have to be other mail dumps.
As a simple experiment, we uploaded into the system our database of pseudo-sociologists, a list of sanctioned persons and companies, as well as those only considered for sanctioning in the future— from the NAZK website. One of the examples of effective matches is the list of VGTRK journalists under sanction, part of whose correspondence can be viewed in the respective "dump".
Additional features
The system allows you to set notifications in case new datasets are added to the database, including a mention of your desired surname or other information (address, phone number, etc.). In addition, the built-in graphical editor allows you to independently visualize the relationships between the figures of the datasets. Also, users with programming skills can download network information from Aleph and work with it in other, convenient environments. Read more about it here.
Get on Aleph and start working with it
We want to provide access to the "leaks" primarily to Ukrainian researchers and journalists. Therefore, we ask you to fill out a short form for verification. We will provide you with access to the leaks via the e-mail address you specify.
The project was implemented with the financial support of the International Renaissance Foundation