Skip to content

Latest commit

 

History

History
63 lines (40 loc) · 3.67 KB

File metadata and controls

63 lines (40 loc) · 3.67 KB

D3.1 Data Collection

Data collection is a fundamental step in the Business Intelligence (BI) process, serving as the foundation upon which all analysis, reporting, and decision-making is built. This module explores the various data sources you might encounter, the challenges associated with data collection, and the importance of proper attribute alignment.

Understanding Data Sources

CSV Files

CSV (Comma-Separated Values) files are a common data format for exporting and importing tabular data. They are easy to generate from various applications and straightforward to process with BI tools.

Use Case: Ideal for relatively simple datasets without hierarchical structures, commonly used for initial data loading and quick exports from databases and spreadsheets.

APIs

Application Programming Interfaces (APIs) allow for automated data retrieval from external systems, providing live access to updated data without manual intervention.

Use Case: APIs are crucial for real-time data feeds, such as financial markets, weather data, or integrating third-party services directly into your BI processes.

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate.

Use Case: JSON is extensively used in web applications for exchanging data between clients and servers. It's particularly useful for unstructured data that doesn't fit neatly into tables.

Other Data Sources for Collection

  • Databases: Directly connecting to SQL or NoSQL databases to pull existing data.
  • Web Scraping: Extracting data from web pages, useful for sources where no direct database access or API is available.
  • IoT Devices: Gathering data from sensors and smart devices, which often produce large volumes of real-time data.

Challenges in Data Collection

Data Quality

  • Accuracy: Ensuring the data accurately reflects real-world conditions.
  • Consistency: Addressing discrepancies in how data is formatted and stored across sources.
  • Completeness: Handling missing or incomplete data, which can lead to skewed analytics.

Scalability

  • Volume: Managing large volumes of data can be challenging, requiring robust infrastructure.
  • Velocity: High-velocity data like streaming data demands real-time processing capabilities.

Security

  • Data Sensitivity: Ensuring sensitive data is securely handled and compliant with regulations like GDPR or HIPAA.
  • Access Control: Controlling who has access to what data, preventing unauthorized data breaches.

Attribute Alignment

Attribute alignment involves ensuring that the data collected from various sources maintains a consistent structure and naming convention, which is critical for accurate analysis. Misalignments can occur due to:

  • Different Source Formats: Variations in how data is structured or named across sources.
  • Schema Changes: Updates in the data source that are not mirrored in the BI system.

Correctly aligning attributes involves:

  • Standardization: Establishing common standards for data formats, naming conventions, and schemas.
  • Data Transformation: Applying transformations during the ETL process to align disparate data into a unified format.

Key BI Topics in Data Collection

  • Automated Data Collection: Implementing automated systems to collect data with minimal human intervention.
  • Data Integration: Combining data from different sources to provide a unified view.
  • Data Governance: Establishing policies and procedures to manage data availability, usability, integrity, and security.

Effective data collection is the first step in data-driven decision-making.