GenAI-Superstream

Agentic Engineering for Data Analysis

Introduction

GenAI-Superstream is a project demonstrating Agentic Engineering and "Vibe Coding" principles for Data Science applications. Created by Reuven Cohen (rUv), this project showcases the integration of advanced AI-powered data analysis techniques with a focus on analyzing the impact of AI on technical jobs in the UK job market.

This implementation demonstrates how to use DuckDB (a high-performance analytical database) in combination with scikit-learn to collect, analyze, and visualize jobs data, with a particular focus on understanding how AI is transforming technical roles in the UK job market.

Core Concepts

Agentics

Agentics (pronounced /əˈdʒentɪks/) is the field of study and engineering practice focused on creating intelligent systems that are:

Proactive: Anticipating and initiating changes
Autonomous: Operating independently
Collaborative: Working effectively with other systems or agents
Targeted: Pursuing defined objectives

This PACT framework guides the development of AI systems that can work alongside humans to solve complex problems.

Vibe Coding

Vibe Coding is an approach to software development that emphasizes:

Crafting code by feel, balancing functionality with readability, style and project mood
Rapid iteration guided by intuitive feedback
Emphasis on clean, expressive naming and structure
Aligning code style with team culture and project "vibe"

Benefits include:

Faster prototyping with fewer mental blocks
Code that feels approachable and motivates collaboration
Consistency through shared aesthetic standards

Vibe Coding vs Agentic Engineering

Vibe Coding	Agentic Engineering
Flow	Structured
Fluid and Intuitive	Process Driven
Little Process	Deep Planning / Test Driven
Human is the feedback loop	Machine is feedback loop
Ideation and Discovery	Iteration and refinement

SPARC Methodology

SPARC is a comprehensive methodology designed to guide the development of robust and scalable applications. SPARC stands for:

Specification: Define clear objectives, detailed requirements, user scenarios, and UI/UX standards
Pseudocode: Map out logical implementation pathways before coding
Architecture: Design modular, maintainable system components using appropriate technology stacks
Refinement: Iteratively optimize code using autonomous feedback loops and stakeholder inputs
Completion: Conduct rigorous testing, finalize comprehensive documentation, and deploy structured monitoring strategies

Each step ensures thorough planning, execution, and reflection throughout the project lifecycle.

Model Context Protocol (MCP)

MCP (Model Context Protocol) is the new standard for LLM-tool integration:

Simple, composable, and totally abstracted
Turns any tool into a native function any model can call
Secure, two-way connections between models and external tools
Plug-and-play "USB-C for AI"
Zero-friction developer experience
Unix mindset applied to AI
Built-in guardrails

Project Implementation

DuckDB Overview

DuckDB is a high-performance, in-process analytical database management system optimized for complex, large-scale analytical SQL queries. It is often described as "SQLite for analytics" due to its lightweight, easy integration and ability to run within the same process as the application without the need for a separate database server.

Key features:

Columnar-vectorized query execution engine: Processes large batches of data at once
Full SQL support: Including complex queries and window functions
Support for popular data formats: CSV, Parquet, etc.
ACID transactional guarantees: Ensures data integrity
Seamless Python integration: Works with pandas DataFrames

UK AI Jobs Data Analysis

This project uses DuckDB to analyze AI's impact on technical jobs in the UK. Key components:

Data Collection: Gathering representative job postings from reliable sources
Data Storage: Structured database schema optimized for analytics
Data Analysis: SQL queries and scikit-learn integration for insights
Visualization: Visual representation of trends and patterns

The job data includes information such as:

Job titles and companies
Locations and salary ranges
Job descriptions
Metrics quantifying AI's impact on each role
Posting dates and sources

Scikit-learn Integration

The project demonstrates how to integrate DuckDB with scikit-learn for advanced analytics:

Feature Engineering: Extracting relevant features from job data
Clustering: Identifying patterns and grouping similar jobs
Trend Analysis: Tracking changes in AI impact over time
Visualization: Creating insightful visualizations of the data

Getting Started

Prerequisites

Python 3.8+ (3.6+ for basic implementation)
pip (Python package manager)

Dependencies

The project uses different dependency sets depending on the implementation:

Basic Implementation (db/ directory):

duckdb - High-performance analytical database
pandas - Data manipulation and analysis
numpy - Numerical computing
scikit-learn - Machine learning library
matplotlib - Plotting and visualization
tabulate - Pretty-print tabular data

Advanced Implementation (advanced/ directory):

All basic dependencies plus:
seaborn - Statistical data visualization
python-dotenv - Environment variable management

MCP Server (genai-mcp/ directory):

mcp - Model Context Protocol framework
gradio - Web-based UI framework
pytest - Testing framework
pyyaml - YAML configuration support

Installation

Clone the repository:

git clone https://github.com/ruvnet/GenAI-Superstream.git
cd GenAI-Superstream

Install the required dependencies:

Basic Implementation:

pip install duckdb pandas numpy scikit-learn matplotlib tabulate

Advanced Implementation (recommended):

pip install duckdb pandas numpy scikit-learn matplotlib seaborn python-dotenv

Set up the MCP server (optional):
```
cd genai-mcp
make setup
```
Optional: Install the Roo Code extension for VS Code:
- Open VS Code
- Go to Extensions view
- Search for Roo Code
- Click Install

Usage

The project consists of several Python scripts that demonstrate different aspects of the system:

db/init_duckdb.py: Core database initialization and utility functions
db/jobs_analytics_example.py: Example analytics and visualization with scikit-learn
db/insert_ai_jobs_duckdb.py: Script to insert representative UK AI jobs data
scripts/review_uk_jobs.py: CLI tool to review and filter job data

Tutorials

Setting Up the DuckDB Database

To initialize the DuckDB database with the proper schema:

from db.init_duckdb import JobsDatabase

# Create a new database instance
db = JobsDatabase()

# Initialize the schema
db.initialize_schema()

print("Database initialized successfully!")
db.close()

Populating with Jobs Data

You can populate the database with UK AI jobs data using the PerplexityAI MCP integration in Roo Code. This approach leverages AI to gather the latest jobs data:

Using PerplexityAI MCP

First, configure the Composio PerplexityAI MCP in your project by creating an mcp.json file:

{
  "servers": {
    "perplexityai": {
      "url": "mcp.composio/your-key-url",
      "tools": [
        "PERPLEXITYAI_PERPLEXITY_AI_SEARCH"
      ]
    }
  }
}

In Roo Code, interact with the PerplexityAI service to research UK AI jobs data. Ask for structured information about how AI is affecting technical jobs in the UK, specifying that you need job titles, companies, locations, salaries, descriptions, AI impact metrics, posting dates, and sources. Request the data in a tabular format suitable for database ingestion.

When crafting your query, use system instructions that request concise, technical responses with structured data formats. For the user content, specifically ask about the latest trends and data on AI's impact on technical jobs in the UK, emphasizing that you need complete job posting information with all required fields for your database.

The PerplexityAI service will return comprehensive, structured information about current AI-related technical jobs in the UK job market, which you can then parse and insert into your DuckDB database.

Parse the returned data and insert it into the DuckDB database:

import duckdb
import json

# Parse data from PerplexityAI response
perplexity_response = json.loads(perplexity_result)
jobs_data = extract_jobs_from_response(perplexity_response)

# Connect to DuckDB and insert data
con = duckdb.connect('db/uk_jobs.duckdb')
for job in jobs_data:
    con.execute('''
        INSERT OR REPLACE INTO jobs
        (job_id, title, company, location, salary, description, ai_impact, date_posted, source)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    ''', (
        job["job_id"],
        job["title"],
        job["company"],
        job["location"],
        job["salary"],
        job["description"],
        job["ai_impact"],
        job["date_posted"],
        job["source"]
    ))
con.close()

Alternatively, use the provided script for sample data:

python db/insert_ai_jobs_duckdb.py

This approach demonstrates how to use Roo Code's MCP capabilities to collect real-time data about AI's impact on the UK job market, directly feeding it into your analytics pipeline.

Running Analytics

To run basic analytics and visualizations on the jobs data:

python db/jobs_analytics_example.py

This script will:

Load data from the DuckDB database
Preprocess the data for machine learning
Run simple clustering and trend analysis on AI impact
Generate visualizations showing the results

The visualization will be saved as db/job_analytics_results.png.

Using the CLI Tool

The project includes a command-line tool for quickly reviewing and filtering job data:

# View all jobs
python scripts/review_uk_jobs.py

# Filter by job title
python scripts/review_uk_jobs.py --title Engineer

# Filter by company
python scripts/review_uk_jobs.py --company Google

Advanced Usage

Custom Analytics

You can create custom analytics by combining DuckDB's SQL capabilities with scikit-learn:

from db.init_duckdb import JobsDatabase
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Connect to the database
db = JobsDatabase()

# Query specific data
df = db.to_dataframe("SELECT * FROM job_postings WHERE ai_impact > 0.7")

# Create text features from job descriptions
tfidf = TfidfVectorizer(max_features=100, stop_words='english')
features = tfidf.fit_transform(df['description'])

# Run clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(features)

# Analyze results
cluster_stats = df.groupby('cluster').agg({
    'ai_impact': ['mean', 'count'],
    'title': lambda x: ', '.join(set(x))[:100]
})

print(cluster_stats)
db.close()

Extending the Database

You can extend the database schema to include additional information:

# Connect to DuckDB
import duckdb
con = duckdb.connect('db/uk_jobs.duckdb')

# Add a new column
con.execute('''
    ALTER TABLE jobs ADD COLUMN remote_work BOOLEAN;
''')

# Update existing records
con.execute('''
    UPDATE jobs 
    SET remote_work = (location LIKE '%Remote%')
''')

con.close()

Integration with Other Systems

The DuckDB database can be integrated with other systems:

# Export to CSV
import duckdb
con = duckdb.connect('db/uk_jobs.duckdb')
con.execute('''
    COPY (SELECT * FROM jobs) TO 'exported_jobs.csv' (HEADER, DELIMITER ',');
''')
con.close()

# Export to a different database system
import duckdb
import pandas as pd
import sqlite3

# Extract from DuckDB
con_duck = duckdb.connect('db/uk_jobs.duckdb')
df = con_duck.execute("SELECT * FROM jobs").fetch_df()
con_duck.close()

# Load into SQLite
con_sqlite = sqlite3.connect('other_system.db')
df.to_sql('jobs', con_sqlite, if_exists='replace', index=False)
con_sqlite.close()

Advanced Implementation

The advanced/ directory contains an enhanced implementation of the GenAI-Superstream project with advanced features including PerplexityAI MCP integration, enhanced database schemas, and sophisticated analytics capabilities.

PerplexityAI MCP Integration

The advanced implementation includes integration with PerplexityAI through the Model Context Protocol (MCP), enabling real-time data gathering from AI-powered search:

# Initialize the advanced database
python advanced/main.py --init

# Gather data from PerplexityAI (prepares MCP query)
python advanced/main.py --gather

# Process PerplexityAI response
python advanced/main.py --response-file=response.json

The PerplexityAI integration allows you to:

Query for the latest UK AI job market trends
Extract structured job data from search results
Automatically parse and store job information
Track data sources and maintain quality metrics

Enhanced Schema and Analytics

The advanced implementation features:

Comprehensive Job Schema: Enhanced job postings table with detailed fields for salary ranges, remote work options, AI impact metrics, and skills tracking
Skills Analysis: Separate skills table for tracking required vs. preferred skills with categories
Company Tracking: Company information with AI focus levels
Historical Data: Job history tracking for trend analysis
Advanced Analytics: Clustering, skills importance analysis, salary trends, and visualization capabilities

Example analytics usage:

from advanced.analytics.metrics import calculate_ai_impact_distribution, perform_cluster_analysis
from advanced.db.queries import JobsDatabase

db = JobsDatabase()

# Get AI impact distribution
impact_df = calculate_ai_impact_distribution(db)

# Perform cluster analysis
df_with_clusters, cluster_stats = perform_cluster_analysis(db, n_clusters=4)

Command-Line Tools

The advanced implementation provides comprehensive CLI tools:

# Data gathering with flexible options
python advanced/data_gatherer.py --gather --role "Machine Learning Engineer" --location "London"

# View database statistics
python advanced/data_gatherer.py --stats

# Dry run mode to preview operations
python advanced/data_gatherer.py --gather --dry-run

See advanced/README.md for detailed documentation of the advanced features.

MCP Server

The project includes a complete MCP (Model Context Protocol) server implementation in the genai-mcp/ directory:

Tools: Functions for data analysis and job market queries
Resources: Access to job data and analytics results
Prompts: Templates for AI-powered job market analysis
Server-Sent Events: Real-time updates and monitoring

To run the MCP server:

cd genai-mcp
make setup
make dev

The MCP server enables seamless integration with AI assistants and other MCP-compatible applications. See genai-mcp/README.md for complete setup and usage instructions.

Project Structure

GenAI-Superstream/
├── advanced/                    # Advanced implementation with MCP integration
│   ├── analytics/              # Advanced analytics and visualizations
│   ├── db/                     # Database operations and queries
│   ├── models/                 # Data models and database schemas
│   ├── perplexity/             # PerplexityAI MCP integration
│   ├── tests/                  # Comprehensive test suite
│   ├── utils/                  # Utility functions and logging
│   ├── data_gatherer.py        # CLI tool for data gathering
│   ├── main.py                 # Main entry point
│   └── README.md               # Advanced implementation documentation
├── db/                         # Basic DuckDB implementation
│   ├── init_duckdb.py          # Core database class and utilities
│   ├── jobs_analytics_example.py # Example analytics and visualization
│   ├── insert_ai_jobs_duckdb.py # Script to insert representative data
│   ├── README.md               # Database documentation
│   └── uk_jobs.duckdb          # The DuckDB database file
├── genai-mcp/                  # MCP server implementation
│   ├── genai_mcp/              # Server source code
│   ├── sse_server.py           # Server-sent events server
│   └── README.md               # MCP server documentation
├── docs/                       # Project documentation
├── presentation/               # Presentation materials
├── scripts/                    # Utility scripts
└── README.md                   # This file

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Created by Reuven Cohen (rUv) - https://github.com/ruvnet/GenAI-Superstream

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.roo		.roo
advanced		advanced
db		db
docs		docs
genai-mcp		genai-mcp
gradio		gradio
presentation		presentation
scripts		scripts
.coverage		.coverage
.env.sample		.env.sample
.gitignore		.gitignore
.roomodes		.roomodes
LICENSE		LICENSE
README.md		README.md
USAGE.md		USAGE.md
duckdb_advanced.db		duckdb_advanced.db
mcpinstructions.md		mcpinstructions.md
requirements.txt		requirements.txt
scc_help.txt		scc_help.txt
scc_output.txt		scc_output.txt

License

ruvnet/GenAI-Superstream

Folders and files

Latest commit

History

Repository files navigation

GenAI-Superstream

Table of Contents

Introduction

Core Concepts

Agentics

Vibe Coding

Vibe Coding vs Agentic Engineering

SPARC Methodology

Model Context Protocol (MCP)

Project Implementation

DuckDB Overview

UK AI Jobs Data Analysis

Scikit-learn Integration

Getting Started

Prerequisites

Dependencies

Installation

Usage

Tutorials

Setting Up the DuckDB Database

Populating with Jobs Data

Using PerplexityAI MCP

Running Analytics

Using the CLI Tool

Advanced Usage

Custom Analytics

Extending the Database

Integration with Other Systems

Advanced Implementation

PerplexityAI MCP Integration

Enhanced Schema and Analytics

Command-Line Tools

MCP Server

Project Structure

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages