By Richie Phillips
In an era where indoor air quality is becoming an increasingly important concern—especially in workplaces, healthcare settings, and homes—plants offer a natural, sustainable solution. Inspired by NASA’s Clean Air Study and follow-up horticultural research, this project investigates how effectively different houseplants can absorb volatile organic compounds (VOCs), such as benzene, trichloroethylene (TCE), and formaldehyde, from indoor environments.
The primary objective was to analyze and rank plants based on their air-purifying potential, measured through two main performance metrics: Total VOCs Removed and VOC Efficiency (μg/cm²). This project integrates a range of skills including SQL querying, statistical testing, Python-based visualization, and data cleaning. The goal was not only to highlight top-performing plant species but also to uncover patterns that could guide future data-informed decisions in sustainable indoor design.
The analysis began by sourcing data from multiple CSV files, each representing VOC removal for a specific chemical compound. These datasets had inconsistent column names, embedded units, and encoding issues—especially involving Unicode characters like "μg" and superscripts. To resolve this, I applied column renaming, whitespace trimming, and character stripping techniques using Python (Pandas).
Each table—benzene, TCE, and formaldehyde—was standardized with common fields:
Common Name
Scientific Name
Leaf Surface Area (cm²)
Total VOC Removed (μg)
These tables were merged into a comprehensive plants dataset. I also calculated a new column, VOC Efficiency, defined as total VOC removed divided by leaf surface area, rounded to two decimal places. This provided a normalized performance metric for comparing plants of different sizes.
To highlight standout species, I created visualizations using Python’s Seaborn and Matplotlib libraries. These included:
Top 10 by Total VOCs Removed: Displayed plants with the highest raw removal scores.
Top 10 by VOC Efficiency: Spotlighted species that removed more VOCs per cm² of leaf area.
Bar Charts by Chemical Type: Individual rankings for Benzene, TCE, and Formaldehyde removal revealed that while some plants were consistent performers (e.g., Gerbera Daisy), others varied by pollutant.
These plots helped contextualize which plants performed best overall and which might be targeted for specific VOC types in real-world applications.
To further investigate trends, I applied multiple statistical tools:
Correlation Heatmap: Revealed strong correlations between Total VOCs Removed and individual compound removals. Interestingly, VOC Efficiency and leaf surface area were less tightly correlated, suggesting the need for compound-level analysis.
Boxplots and Violin Plots: Compared VOC efficiency by leaf size (Small vs. Large). These visualizations suggested that smaller-leaved plants might have greater VOC efficiency, though variability was present.
Two-Sample t-Test: A t-test compared the means of VOC efficiency across the leaf size groups. At a 5% significance level, the result was not statistically significant, indicating that while visual trends existed, there wasn’t enough evidence to confirm a performance difference by leaf size alone.
For scalability and SQL practice, I imported the cleaned dataset into a PostgreSQL database. This allowed me to:
Write efficient SQL queries to retrieve top performers by Total VOCs Removed and VOC Efficiency.
Experiment with joins by incorporating supplementary soil and bacteria tables.
Clean and reinsert records where formatting issues (e.g., missing pH columns or non-matching plant names) disrupted joins.
Though the bacterial and soil datasets had limitations (incomplete or mismatched entries), I successfully created a prototype JOIN to pair performance metrics with environmental conditions, such as moisture, pH, and bacterial counts.
Gerbera Daisy was the standout performer across nearly all metrics: highest total removal, VOC efficiency, and consistent rankings by compound.
Janet Craig, Bamboo Palm, and Peace Lily were also strong contenders.
VOC efficiency varied significantly across species, not always in direct proportion to surface area—highlighting the importance of normalized metrics.
Statistical analysis suggested no significant difference in VOC efficiency based solely on leaf size, though visual trends hinted at smaller leaves potentially offering better per-area performance.
To improve clarity for a broad audience, a glossary of terms was added, including definitions for:
VOC (Volatile Organic Compound)
Efficiency (μg/cm²)
Leaf Surface Area
TCE, Benzene, and Formaldehyde
Boxplot, Violin Plot, T-Test
Correlation Matrix
This ensures the portfolio remains accessible to both technical and non-technical viewers.
This project represents a full data analytics lifecycle—from raw file ingestion and data cleaning to advanced SQL queries and statistical evaluation. It combines environmental science with data science to generate actionable insights. Whether applied to smart building design, sustainable architecture, or biophilic planning, this kind of analysis empowers evidence-based decisions using clean, well-structured data.
By highlighting the top-performing plant species, this project also advocates for the integration of nature into indoor spaces—not only for aesthetics but for tangible health benefits. Future work could explore time-based VOC measurements, cost-efficiency comparisons, or machine learning approaches for predictive classification of plant performance.
Created and authored by Richie Phillips | Data Analytics Portfolio
Data source: NASA Clean Air Study and public VOC removal datasets