Jupyter Notebook Best Practices
Best practices for creating reproducible, maintainable research notebooks. Follow these guidelines to make your notebooks easier to understand, share, and reproduce.
Use Case
Use these practices when you need to:
- Create reproducible research
- Share notebooks with collaborators
- Document experiments
- Present results
Docker Setup
Docker Run
1# Run Jupyter Lab
2docker run -d \
3 --name jupyter \
4 -p 8888:8888 \
5 -v $(pwd):/home/jovyan/work \
6 -e JUPYTER_ENABLE_LAB=yes \
7 jupyter/scipy-notebook
8
9# Get token
10docker logs jupyter
Docker Compose
1version: '3.8'
2
3services:
4 jupyter:
5 image: jupyter/scipy-notebook:latest
6 container_name: jupyter
7 ports:
8 - "8888:8888"
9 volumes:
10 - ./notebooks:/home/jovyan/work
11 environment:
12 JUPYTER_TOKEN: "mytoken"
13 restart: unless-stopped
14
15volumes:
16 jupyter-data:
Best Practices
1. Structure Your Notebook
1# Cell 1: Title and Description
2"""
3# Experiment: Algorithm Performance Analysis
4**Date:** 2024-12-12
5**Author:** Your Name
6**Goal:** Compare performance of algorithms A, B, and C
7"""
8
9# Cell 2: Imports (all at the top)
10import numpy as np
11import pandas as pd
12import matplotlib.pyplot as plt
13from scipy import stats
14
15# Cell 3: Configuration and Constants
16RANDOM_SEED = 42
17DATA_PATH = "data/input.csv"
18OUTPUT_PATH = "results/"
19
20np.random.seed(RANDOM_SEED)
21
22# Cell 4: Helper Functions
23def load_data(path):
24 """Load and preprocess data."""
25 return pd.read_csv(path)
26
27# Cell 5+: Analysis sections with markdown headers
2. Use Markdown Cells Liberally
1## Data Loading
2
3Load the dataset and perform initial exploration.
4
5**Expected outcome:** Dataset with 1000 samples, 10 features
3. Magic Commands
1# Time cell execution
2%%time
3result = expensive_computation()
4
5# Profile memory usage
6%load_ext memory_profiler
7%memit large_array = np.zeros((10000, 10000))
8
9# Reload modules automatically
10%load_ext autoreload
11%autoreload 2
12
13# Display all outputs (not just last)
14from IPython.core.interactiveshell import InteractiveShell
15InteractiveShell.ast_node_interactivity = "all"
16
17# Set matplotlib inline
18%matplotlib inline
4. Version Control Integration
1# Install nbstripout to remove output from commits
2pip install nbstripout
3
4# Set up for repository
5nbstripout --install
6
7# Or manually before commit
8jupyter nbconvert --clear-output --inplace notebook.ipynb
5. Reproducibility Checklist
1# Cell 1: Environment info
2import sys
3import platform
4
5print(f"Python: {sys.version}")
6print(f"Platform: {platform.platform()}")
7print(f"NumPy: {np.__version__}")
8print(f"Pandas: {pd.__version__}")
9
10# Cell 2: Set all random seeds
11import random
12import numpy as np
13
14SEED = 42
15random.seed(SEED)
16np.random.seed(SEED)
17
18# If using PyTorch
19# import torch
20# torch.manual_seed(SEED)
21
22# If using TensorFlow
23# import tensorflow as tf
24# tf.random.set_seed(SEED)
Examples
Example 1: Experiment Template
1"""
2# Experiment: [Name]
3**Date:** YYYY-MM-DD
4**Hypothesis:** [What you're testing]
5**Expected Outcome:** [What you expect to find]
6"""
7
8# Imports
9import numpy as np
10import matplotlib.pyplot as plt
11
12# Configuration
13SEED = 42
14np.random.seed(SEED)
15
16# Load Data
17data = load_data()
18print(f"Data shape: {data.shape}")
19
20# Preprocessing
21processed_data = preprocess(data)
22
23# Analysis
24results = run_experiment(processed_data)
25
26# Visualization
27plt.figure(figsize=(10, 6))
28plt.plot(results)
29plt.title("Experiment Results")
30plt.show()
31
32# Conclusions
33"""
34## Results
35- Finding 1: [Description]
36- Finding 2: [Description]
37
38## Next Steps
39- [ ] Investigate edge case
40- [ ] Run with larger dataset
41"""
Example 2: Debugging Setup
1# Enable detailed error messages
2%xmode Verbose
3
4# Post-mortem debugging
5%pdb on
6
7# Interactive debugger
8from IPython.core.debugger import set_trace
9
10def problematic_function(x):
11 set_trace() # Debugger will stop here
12 result = x / 0
13 return result
Example 3: Progress Bars
1from tqdm.notebook import tqdm
2import time
3
4# For loops
5for i in tqdm(range(100), desc="Processing"):
6 time.sleep(0.01)
7
8# For pandas operations
9tqdm.pandas(desc="Applying function")
10df['result'] = df['column'].progress_apply(lambda x: expensive_function(x))
Example 4: Interactive Widgets
1import ipywidgets as widgets
2from IPython.display import display
3
4def plot_function(frequency=1.0, amplitude=1.0):
5 x = np.linspace(0, 10, 1000)
6 y = amplitude * np.sin(frequency * x)
7 plt.figure(figsize=(10, 4))
8 plt.plot(x, y)
9 plt.title(f"Sine Wave (f={frequency}, A={amplitude})")
10 plt.show()
11
12# Create interactive controls
13widgets.interact(
14 plot_function,
15 frequency=(0.1, 5.0, 0.1),
16 amplitude=(0.1, 2.0, 0.1)
17)
Notes
- Keep notebooks focused - one experiment per notebook
- Use descriptive cell outputs (print statements, plots)
- Document assumptions and decisions
- Include negative results - they're valuable too
- Export final results to separate files
Gotchas/Warnings
- ⚠️ Cell order: Notebooks can be run out of order - test by "Restart & Run All"
- ⚠️ Hidden state: Variables persist between cells - can cause confusion
- ⚠️ Large outputs: Clear output of cells with large data to reduce file size
- ⚠️ Git conflicts: Notebook JSON is hard to merge - use nbdime or strip outputs
comments powered by Disqus