Jupyter Notebook Best Practices

Best practices for creating reproducible, maintainable research notebooks. Follow these guidelines to make your notebooks easier to understand, share, and reproduce.

Use Case

Use these practices when you need to:

  • Create reproducible research
  • Share notebooks with collaborators
  • Document experiments
  • Present results

Docker Setup

Docker Run

 1# Run Jupyter Lab
 2docker run -d \
 3  --name jupyter \
 4  -p 8888:8888 \
 5  -v $(pwd):/home/jovyan/work \
 6  -e JUPYTER_ENABLE_LAB=yes \
 7  jupyter/scipy-notebook
 8
 9# Get token
10docker logs jupyter

Docker Compose

 1version: '3.8'
 2
 3services:
 4  jupyter:
 5    image: jupyter/scipy-notebook:latest
 6    container_name: jupyter
 7    ports:
 8      - "8888:8888"
 9    volumes:
10      - ./notebooks:/home/jovyan/work
11    environment:
12      JUPYTER_TOKEN: "mytoken"
13    restart: unless-stopped
14
15volumes:
16  jupyter-data:

Best Practices

1. Structure Your Notebook

 1# Cell 1: Title and Description
 2"""
 3# Experiment: Algorithm Performance Analysis
 4**Date:** 2024-12-12
 5**Author:** Your Name
 6**Goal:** Compare performance of algorithms A, B, and C
 7"""
 8
 9# Cell 2: Imports (all at the top)
10import numpy as np
11import pandas as pd
12import matplotlib.pyplot as plt
13from scipy import stats
14
15# Cell 3: Configuration and Constants
16RANDOM_SEED = 42
17DATA_PATH = "data/input.csv"
18OUTPUT_PATH = "results/"
19
20np.random.seed(RANDOM_SEED)
21
22# Cell 4: Helper Functions
23def load_data(path):
24    """Load and preprocess data."""
25    return pd.read_csv(path)
26
27# Cell 5+: Analysis sections with markdown headers

2. Use Markdown Cells Liberally

1## Data Loading
2
3Load the dataset and perform initial exploration.
4
5**Expected outcome:** Dataset with 1000 samples, 10 features

3. Magic Commands

 1# Time cell execution
 2%%time
 3result = expensive_computation()
 4
 5# Profile memory usage
 6%load_ext memory_profiler
 7%memit large_array = np.zeros((10000, 10000))
 8
 9# Reload modules automatically
10%load_ext autoreload
11%autoreload 2
12
13# Display all outputs (not just last)
14from IPython.core.interactiveshell import InteractiveShell
15InteractiveShell.ast_node_interactivity = "all"
16
17# Set matplotlib inline
18%matplotlib inline

4. Version Control Integration

1# Install nbstripout to remove output from commits
2pip install nbstripout
3
4# Set up for repository
5nbstripout --install
6
7# Or manually before commit
8jupyter nbconvert --clear-output --inplace notebook.ipynb

5. Reproducibility Checklist

 1# Cell 1: Environment info
 2import sys
 3import platform
 4
 5print(f"Python: {sys.version}")
 6print(f"Platform: {platform.platform()}")
 7print(f"NumPy: {np.__version__}")
 8print(f"Pandas: {pd.__version__}")
 9
10# Cell 2: Set all random seeds
11import random
12import numpy as np
13
14SEED = 42
15random.seed(SEED)
16np.random.seed(SEED)
17
18# If using PyTorch
19# import torch
20# torch.manual_seed(SEED)
21
22# If using TensorFlow
23# import tensorflow as tf
24# tf.random.set_seed(SEED)

Examples

Example 1: Experiment Template

 1"""
 2# Experiment: [Name]
 3**Date:** YYYY-MM-DD
 4**Hypothesis:** [What you're testing]
 5**Expected Outcome:** [What you expect to find]
 6"""
 7
 8# Imports
 9import numpy as np
10import matplotlib.pyplot as plt
11
12# Configuration
13SEED = 42
14np.random.seed(SEED)
15
16# Load Data
17data = load_data()
18print(f"Data shape: {data.shape}")
19
20# Preprocessing
21processed_data = preprocess(data)
22
23# Analysis
24results = run_experiment(processed_data)
25
26# Visualization
27plt.figure(figsize=(10, 6))
28plt.plot(results)
29plt.title("Experiment Results")
30plt.show()
31
32# Conclusions
33"""
34## Results
35- Finding 1: [Description]
36- Finding 2: [Description]
37
38## Next Steps
39- [ ] Investigate edge case
40- [ ] Run with larger dataset
41"""

Example 2: Debugging Setup

 1# Enable detailed error messages
 2%xmode Verbose
 3
 4# Post-mortem debugging
 5%pdb on
 6
 7# Interactive debugger
 8from IPython.core.debugger import set_trace
 9
10def problematic_function(x):
11    set_trace()  # Debugger will stop here
12    result = x / 0
13    return result

Example 3: Progress Bars

 1from tqdm.notebook import tqdm
 2import time
 3
 4# For loops
 5for i in tqdm(range(100), desc="Processing"):
 6    time.sleep(0.01)
 7
 8# For pandas operations
 9tqdm.pandas(desc="Applying function")
10df['result'] = df['column'].progress_apply(lambda x: expensive_function(x))

Example 4: Interactive Widgets

 1import ipywidgets as widgets
 2from IPython.display import display
 3
 4def plot_function(frequency=1.0, amplitude=1.0):
 5    x = np.linspace(0, 10, 1000)
 6    y = amplitude * np.sin(frequency * x)
 7    plt.figure(figsize=(10, 4))
 8    plt.plot(x, y)
 9    plt.title(f"Sine Wave (f={frequency}, A={amplitude})")
10    plt.show()
11
12# Create interactive controls
13widgets.interact(
14    plot_function,
15    frequency=(0.1, 5.0, 0.1),
16    amplitude=(0.1, 2.0, 0.1)
17)

Notes

  • Keep notebooks focused - one experiment per notebook
  • Use descriptive cell outputs (print statements, plots)
  • Document assumptions and decisions
  • Include negative results - they're valuable too
  • Export final results to separate files

Gotchas/Warnings

  • ⚠️ Cell order: Notebooks can be run out of order - test by "Restart & Run All"
  • ⚠️ Hidden state: Variables persist between cells - can cause confusion
  • ⚠️ Large outputs: Clear output of cells with large data to reduce file size
  • ⚠️ Git conflicts: Notebook JSON is hard to merge - use nbdime or strip outputs
comments powered by Disqus