Professional Python project: exploratory data analysis with Jupyter notebooks.
Two languages:
- This file is written in Markdown, a simple markup language for presenting text.
- Our analytics logic is written in Python, a scripting language for implementing logic.
When we first encounter a new and unknown data set, we want to explore: run some quick checks, view the distributions, see if the data is clean (or if there are many missing values or outliers).
This task is commonly called Exploratory Data Analysis (EDA). For EDA, it is useful to combine presentation and code. For this, we use Jupyter notebooks.
Notebooks combine Markdown cells for section headings and narrative with Python Code cells for calculations and charts.
After running the example script and notebook files, you'll create a similar notebook to explore a different tabular data file (a dataset with rows and columns).
Follow the detailed instructions at: 01. Set Up Your Machine
-
Get Repository: Sign in to GitHub, open this repository in your browser, and click Copy this template to get a copy in YOURACCOUNT.
-
Configure Repository Settings:
- Select your repository Settings (the gear icon way on the right).
- Go to Pages tab / Enable GitHub Pages / Build and deployment / set Source to GitHub Actions
- Go to Advanced Security tab / Dependabot / Dependabot security updates / Enable
- Go to Advanced Security tab / Dependabot / Grouped security updates / Enable
-
Clone to local: Open a machine terminal in your
Reposfolder and clone your new repo.
git clone https://github.com/YOURACCOUNT/datafun-04-notebooks- Open project in VS Code: Change directory into the repo and open the project in VS Code by running
code .("code dot"):
cd datafun-04-notebooks
code .-
Install recommended extensions.
- When VS Code opens, accept the Extension Recommendations (click
Install Allor similar when asked).
- When VS Code opens, accept the Extension Recommendations (click
-
Set up a project Python environment (managed by
uv) and align VS Code with it.- Use VS Code menu option
Terminal/New Terminalto open a VS Code terminal in the root project folder. - Run the following commands, one at a time, hitting ENTER after each:
uv self update uv python pin 3.14 uv sync --extra dev --extra docs --upgrade
- Use VS Code menu option
If asked: "We noticed a new environment has been created. Do you want to select it for the workspace folder?" Click "Yes".
If successful, you'll see a new .venv folder appear in the root project folder.
Optional (recommended): install and run pre-commit checks (repeat the git add and commit twice if needed):
uvx pre-commit install
git add -A
uvx pre-commit run --all-files
git add -A
uvx pre-commit run --all-filesFor more detailed instructions and troubleshooting, see the pro guide at: 02. Set Up Your Project
🛑 Do not continue until all REQUIRED steps are complete and verified.
Follow the detailed instructions at: 03. Daily Workflow
Commands are provided below to:
- Git pull
- Run and check the Python files
- Build and serve docs
- Save progress with Git add-commit-push
- Update project files
VS Code should have only this project (datafun-04-notebooks) open.
Use VS Code menu option Terminal / New Terminal and run the following commands:
git pullIn this project, notebooks are the primary analysis artifact; but scripts can be used to mirror the core logic.
In the same VS Code terminal, run the example Python source files as modules (preferred):
uv run python -m datafun_04_notebooks.app_caseIf a command fails, verify:
- Only this project is open in VS Code.
- The terminal is open in the project root folder.
- The
uv sync --extra dev --extra docs --upgradecommand completed successfully.
Run Python checks and tests (as available):
uv run ruff format .
uv run ruff check . --fix
uv run pytest --cov=src --cov-report=term-missingBuild and serve docs (hit CTRL+c in the VS Code terminal to quit serving):
uv run mkdocs build --strict
uv run mkdocs serveWhile editing project code and docs, repeat the commands above to run files, check them, and rebuild docs as needed.
Save progress frequently (some tools may make changes; you may need to re-run git add and commit to ensure everything gets committed before pushing):
git add -A
git commit -m "update"
git push -u origin mainAdditional details and troubleshooting are available in the Pro-Analytics-02 Documentation.
Open mkdocs.yaml.
This file configures the associated project documentation website (powered by MkDocs)
Use CTRL+f to find each occurrence of the source GitHub account (e.g. denisecase).
Change each occurrence to point to your GitHub account instead (spacing and capitalization MUST match the URL of your GitHub account exactly.)
Edit this file in VS Code.
Use CTRL+f to find each occurrence of the source GitHub account (e.g. denisecase).
Change each occurrence to point to your GitHub account instead (spacing and capitalization MUST match the URL of your GitHub account exactly.)
- Read the code file in src/.
- Run the code file in src/ following this README instructions.
- Confirm that a project.log was generated in the root project folder.
- Git add, commit, push to GitHub.
- Verify your project.log file is visible in GitHub.
In VS Code, with this project open, navigate to the notebooks/ folder.
Open eda_case.ipynb.
Follow the instructions to:
- Select the notebook kernel.
- Run All.
- Git add, commit, push to GitHub.
- Verify the executed notebook is visible in GitHub.
If there are any errors, try to figure out how to address them. After getting a good example notebook, git add-commit-push to GitHub. Verify the example notebook is presented as you like.
Now apply what you learned. Create a new notebook and perform EDA on a different dataset.
Recommended Option 1: Use a Seaborn Built-in Dataset
Seaborn includes several datasets. To see the list:
import seaborn as sns
print(sns.get_dataset_names())Good choices for practice:
iris- flower measurements (150 rows, 5 columns)tips- restaurant tipping data (244 rows, 7 columns)diamonds- diamond prices and attributes (53940 rows, 10 columns)mpg- car fuel efficiency (398 rows, 9 columns)titanic- passenger survival data (891 rows, 15 columns)
Load any of these with: df = sns.load_dataset('dataset_name')
Alternatively, Option 2: Choose Your Own Tabular Dataset
Put your dataset in data/raw/ as a csv file. Use pathlib Paths to create a path to your csv file.
Load a CSV file with: df = pd.read_csv('path_to_your_file.csv')
Follow the example, and ensure you have:
- Title and header (author, purpose, date, dataset info with source/citation)
- Numbered sections that match the example.
- Good narrative showing your observations and insights as you work through the process.
- You do not need to add to or modify
tests/. They are provided for example only. - You do not need to view or modify any of the supporting config files.
- Many of the repo files are silent helpers. Explore as you like, but nothing is required.
- You do NOT need to understand everything. Understanding builds naturally over time.
- Use the UP ARROW and DOWN ARROW in the terminal to scroll through past commands.
- Use
CTRL+fto find (and replace) with in a file.
If you see something like this in your terminal: >>> or ...
You accidentally started Python interactive mode.
It happens.
Press Ctrl+c (both keys together) or Ctrl+Z then Enter on Windows.
- Pro-Analytics-02 - guide to professional Python
- ANNOTATIONS.md - REQ/WHY/OBS annotations used
- INSTRUCTORS.md - guidance and notes for instructors and maintainers
- POLICIES.md - project rules and expectations that apply to all contributors
- SE_MANIFEST.toml - project intent, scope, and role
CITATION.cff - TODO: update author and repository fields to reflect your creative work