Jupyter Notebook Best Practices

Dec 12, 2024 · 4 min read · jupyter notebook best-practices research jupyter-knowhow ·

Share on:

Best practices for creating reproducible, maintainable research notebooks. Follow these guidelines to make your notebooks easier to understand, share, and reproduce.

Use Case

Use these practices when you need to:

Create reproducible research
Share notebooks with collaborators
Document experiments
Present results

Docker Setup

Docker Run

 1# Run Jupyter Lab
 2docker run -d \
 3  --name jupyter \
 4  -p 8888:8888 \
 5  -v $(pwd):/home/jovyan/work \
 6  -e JUPYTER_ENABLE_LAB=yes \
 7  jupyter/scipy-notebook
 8
 9# Get token
10docker logs jupyter

Docker Compose

 1version: '3.8'
 2
 3services:
 4  jupyter:
 5    image: jupyter/scipy-notebook:latest
 6    container_name: jupyter
 7    ports:
 8      - "8888:8888"
 9    volumes:
10      - ./notebooks:/home/jovyan/work
11    environment:
12      JUPYTER_TOKEN: "mytoken"
13    restart: unless-stopped
14
15volumes:
16  jupyter-data:

Best Practices

1. Structure Your Notebook

 1# Cell 1: Title and Description
 2"""
 3# Experiment: Algorithm Performance Analysis
 4**Date:** 2024-12-12
 5**Author:** Your Name
 6**Goal:** Compare performance of algorithms A, B, and C
 7"""
 8
 9# Cell 2: Imports (all at the top)
10import numpy as np
11import pandas as pd
12import matplotlib.pyplot as plt
13from scipy import stats
14
15# Cell 3: Configuration and Constants
16RANDOM_SEED = 42
17DATA_PATH = "data/input.csv"
18OUTPUT_PATH = "results/"
19
20np.random.seed(RANDOM_SEED)
21
22# Cell 4: Helper Functions
23def load_data(path):
24    """Load and preprocess data."""
25    return pd.read_csv(path)
26
27# Cell 5+: Analysis sections with markdown headers

2. Use Markdown Cells Liberally

1## Data Loading
2
3Load the dataset and perform initial exploration.
4
5**Expected outcome:** Dataset with 1000 samples, 10 features

3. Magic Commands

 1# Time cell execution
 2%%time
 3result = expensive_computation()
 4
 5# Profile memory usage
 6%load_ext memory_profiler
 7%memit large_array = np.zeros((10000, 10000))
 8
 9# Reload modules automatically
10%load_ext autoreload
11%autoreload 2
12
13# Display all outputs (not just last)
14from IPython.core.interactiveshell import InteractiveShell
15InteractiveShell.ast_node_interactivity = "all"
16
17# Set matplotlib inline
18%matplotlib inline

4. Version Control Integration

1# Install nbstripout to remove output from commits
2pip install nbstripout
3
4# Set up for repository
5nbstripout --install
6
7# Or manually before commit
8jupyter nbconvert --clear-output --inplace notebook.ipynb

5. Reproducibility Checklist

 1# Cell 1: Environment info
 2import sys
 3import platform
 4
 5print(f"Python: {sys.version}")
 6print(f"Platform: {platform.platform()}")
 7print(f"NumPy: {np.__version__}")
 8print(f"Pandas: {pd.__version__}")
 9
10# Cell 2: Set all random seeds
11import random
12import numpy as np
13
14SEED = 42
15random.seed(SEED)
16np.random.seed(SEED)
17
18# If using PyTorch
19# import torch
20# torch.manual_seed(SEED)
21
22# If using TensorFlow
23# import tensorflow as tf
24# tf.random.set_seed(SEED)

Examples

Example 1: Experiment Template

 1"""
 2# Experiment: [Name]
 3**Date:** YYYY-MM-DD
 4**Hypothesis:** [What you're testing]
 5**Expected Outcome:** [What you expect to find]
 6"""
 7
 8# Imports
 9import numpy as np
10import matplotlib.pyplot as plt
11
12# Configuration
13SEED = 42
14np.random.seed(SEED)
15
16# Load Data
17data = load_data()
18print(f"Data shape: {data.shape}")
19
20# Preprocessing
21processed_data = preprocess(data)
22
23# Analysis
24results = run_experiment(processed_data)
25
26# Visualization
27plt.figure(figsize=(10, 6))
28plt.plot(results)
29plt.title("Experiment Results")
30plt.show()
31
32# Conclusions
33"""
34## Results
35- Finding 1: [Description]
36- Finding 2: [Description]
37
38## Next Steps
39- [ ] Investigate edge case
40- [ ] Run with larger dataset
41"""

Example 2: Debugging Setup

 1# Enable detailed error messages
 2%xmode Verbose
 3
 4# Post-mortem debugging
 5%pdb on
 6
 7# Interactive debugger
 8from IPython.core.debugger import set_trace
 9
10def problematic_function(x):
11    set_trace()  # Debugger will stop here
12    result = x / 0
13    return result

Example 3: Progress Bars

 1from tqdm.notebook import tqdm
 2import time
 3
 4# For loops
 5for i in tqdm(range(100), desc="Processing"):
 6    time.sleep(0.01)
 7
 8# For pandas operations
 9tqdm.pandas(desc="Applying function")
10df['result'] = df['column'].progress_apply(lambda x: expensive_function(x))

Example 4: Interactive Widgets

 1import ipywidgets as widgets
 2from IPython.display import display
 3
 4def plot_function(frequency=1.0, amplitude=1.0):
 5    x = np.linspace(0, 10, 1000)
 6    y = amplitude * np.sin(frequency * x)
 7    plt.figure(figsize=(10, 4))
 8    plt.plot(x, y)
 9    plt.title(f"Sine Wave (f={frequency}, A={amplitude})")
10    plt.show()
11
12# Create interactive controls
13widgets.interact(
14    plot_function,
15    frequency=(0.1, 5.0, 0.1),
16    amplitude=(0.1, 2.0, 0.1)
17)

Notes

Keep notebooks focused - one experiment per notebook
Use descriptive cell outputs (print statements, plots)
Document assumptions and decisions
Include negative results - they're valuable too
Export final results to separate files

Gotchas/Warnings

⚠️ Cell order: Notebooks can be run out of order - test by "Restart & Run All"
⚠️ Hidden state: Variables persist between cells - can cause confusion
⚠️ Large outputs: Clear output of cells with large data to reduce file size
⚠️ Git conflicts: Notebook JSON is hard to merge - use nbdime or strip outputs