Cybersecurity Data Sources: A Comprehensive Guide to Open-Source Intelligence (OSINT)
DataScience CyberSecurity - #0002
Introduction
Open-Source Intelligence (OSINT) is the practice of collecting and analyzing publicly available data from a variety of sources, including social media, news outlets, blogs, websites, broadcast TV, and radio. This encompasses data mining, crawling techniques, data extraction, and data analysis. OSINT involves gathering data in various forms such as text, video, images, or audio. With the incorporation of advanced technologies like machine learning and neural networks, OSINT tools can recognize trends and patterns, identifying crucial elements such as individuals or topics. OSINT has become an essential tool in the cybersecurity toolkit, providing valuable information that can help identify threats, vulnerabilities, and potential attacks.
OSINT: Applications in Data Science
“Data is a precious thing and it will last longer than the systems themselves.” -Tim Berners-Lee
Data is Power
In today’s digital age, we are bombarded with a staggering amount of information from various online sources. Every day, hundreds of thousands of hours of videos are uploaded, millions of images are shared, and vast amounts of text are published—more than can be indexed by search engines. This deluge of information doesn’t even include data behind restricted access, such as private databases and paid services.
The internet is the world’s largest database of information, and it’s growing exponentially. To leverage this vast pool of data effectively, we need to employ data mining and AI/ML techniques to acquire, process, analyze, and identify threats and risks. These technologies enhance decision-making across organizations by providing actionable intelligence.
Open Source Data
Published or broadcast for public consumption: Information readily available in media such as newspapers, television, or radio.
Available on request to the public: Data that can be obtained by anyone upon request, such as public records or FOIA requests.
Accessible online or otherwise to the public: Information available on websites or in public libraries.
Available to the public by subscription or purchase: Data that requires a subscription or one-time payment to access, such as industry reports or journals.
Could be seen or heard by any casual observer: Observational data from public places or events.
Made available at a meeting open to the public: Information shared during public meetings or conferences.
Obtained by visiting any place or attending any event that is open to the public: Data gathered from public places or events, like trade shows or public demonstrations.
Categories of OSINT Data Sources
Social Media Platforms:
Real-time insights and trends from platforms like Twitter, Facebook, LinkedIn, and Reddit.
Public Databases and Government Publications:
Official data and security advisories come from sources like the National Vulnerability Database (NVD) and the Open Source Vulnerability Database (OSVDB).
Dark Web and Deep Web:
Hidden parts of the internet where threat actors operate are accessible through specialized tools and techniques.
Online Forums and Communities:
Discussions and information sharing among cybersecurity professionals in forums like Stack Exchange and specialized cybersecurity communities.
WHOIS and DNS Data:
Domain registration details and historical DNS information for tracking and attribution purposes.
OSINT Techniques
Web Scraping:
Extracting data from websites using specialized algorithms and tools.
Social Media Analytics:
Analyzing social media data to understand trends, sentiment, and behavior.
Geospatial Intelligence:
Analyzing geospatial data to understand location-based trends and patterns.
Network Scanning:
Scanning networks to identify vulnerabilities and open ports.
Smart Searching with Google Dork:
Using advanced searching techniques to discover interesting information. This is available in Bing, Duckduckgo, and other search engines.
Tools for OSINT
Recon-ng:
A powerful Python tool that automates time-consuming OSINT activities such as data gathering.
Maltego & Maltego CE:
Uncovers relationships between people, companies, domains, and publicly accessible data. It helps in interactive data mining with rich visualization showing relationships between different data in the collection.
theHarvester:
A simple tool designed to capture public data that exists outside an organization’s owned network.
Shodan:
A powerful tool focused on the Internet of Things (IoT). Indexes the IoT and finds webcams, traffic lights, routers, smart devices, fridges, and anything connected to the internet.
Babel X:
An AI-enabled data aggregation and analysis tool.
RSOE EDIS:
A geospatial tool for emergency and disaster incident reporting. It monitors, aggregates, analyzes, and notifies, focusing on emergency and disaster information reporting.
Advantages of OSINT Tools
Data Aggregation: From unstructured data into a structured, query-able, filterable, sortable, and digestible format.
Analysis and Enrichment: Augmenting the data with additional metadata and third-party information to help analyze, validate, label, group, and deduplicate.
Visualization: Creating mind maps, visualizing relationships, geographical views, and data time-lapse.
Automated Alerting and Reporting: Continuous monitoring and automation allow for timely alerts and reporting.
The Synergy: AI and OSINT
The combination of AI and OSINT is more than the sum of its parts. AI technologies enable computers to mimic human cognitive functions, such as learning, reasoning, problem-solving, and pattern recognition. This synergy has the potential to revolutionize the way information is gathered and processed, offering valuable insights for a wide range of applications, including cybersecurity, business intelligence, threat analysis, and decision-making.
Benefits of AI in OSINT
Efficient Data Processing:
AI can process vast amounts of data quickly and accurately, reducing the time and effort required for manual analysis.
Pattern Recognition:
AI algorithms can identify patterns and anomalies in data, enabling analysts to uncover valuable insights that may have gone unnoticed.
Improved Accuracy:
AI-powered OSINT can reduce the risk of human error, providing more accurate and reliable results.
Scalability:
AI can handle large volumes of data, making it an ideal solution for big data analytics.
Example: theHarvester
Below is a more advanced example demonstrating how to use theHarvester
to collect OSINT data:
This command utilizes TheHarvester to search for up to 200 results related to the domain microsoft.com
using the Bing search engine. This can help identify potential vulnerabilities and gather intelligence on target organizations.
This command utilizes TheHarvester to collect OSINT data that targets the domain facebook.com
and fetches up to 200 results using the DNSDumpster search engine. This aids in uncovering subdomains and other DNS information about the target organization.
This command runs TheHarvester to gather OSINT data targeting the domain ubuntu.com
and uses all available search engines to fetch up to 200 results. This comprehensive search helps to collect diverse and extensive information about the target organization.
Example: Maltego
Company - Kaspersky
The "Search Domains by Company" transform is applied to the Kaspersky company. This process uncovers various domains associated with the organization, displaying a network of linked entities such as Kaspersky Labs GmbH, Kaspersky Labs Limited, and key individuals including Eugene Kaspersky. The resulting graph visually represents these connections, demonstrating how Kaspersky's presence spans multiple countries and registries, thereby facilitating a comprehensive OSINT investigation.
Person: Mark Zuckerberg
In this Maltego analysis, the "Google Social Network" transform is applied to the individual Mark Zuckerberg. This transform searches for social media profiles associated with the person, revealing connections across various platforms such as Facebook, Instagram, LinkedIn, Quora, Reddit, and YouTube. The resulting visualization presents a comprehensive network of Mark Zuckerberg's online presence, illustrating how his identity and activities are represented across different social media channels.
Key Takeaways
OSINT is based exclusively on publicly available data, such as the contents of the open web.
AI is key to expanding and improving OSINT. In particular, it enables human analysts to collect, enrich, analyze, and disseminate information in a timely and decisive manner.
Human interaction and input are vital to the process and cannot be entirely replaced by AI.