Category: Data Analysis

Data analysis projects and tutorials using R, Python, and statistical methods

  • SPC Charts in R for Semiconductor Process Monitoring

    In semiconductor manufacturing, maintaining consistent process performance across wafers and across batches is critical for device performance. One of the most powerful tools for monitoring this consistency is the Statistical Process Control (SPC) Chart. In this tutorial, we walk through how to apply SPC charts in R to simulate thin film deposition data, covering X-bar and R charts, control limit calculations, and practical interpretation for process engineers.

    If you are new to R for data analysis, you may also find my earlier post on spreadsheets as a common programming tool a useful foundation before diving into statistical methods.

    Why SPC Matters in Process Monitoring

    Semiconductor manufacturing processes require tight control over critical parameters to ensure consistent device performance, yield, and reliability. Across processes such as deposition, etch, lithography, and thermal treatments, even small variations in factors like temperature, pressure, gas flow, or power can lead to measurable shifts in material properties and device characteristics. Statistical Process Control (SPC) provides a real-time method for detecting process drift early, enabling engineers to identify and correct deviations before they result in out-of-specification wafers or reduced manufacturing yield.

    Key SPC concepts for process monitoring include:

    • X-bar chart: Tracks the average value of a critical process parameter (such as film thickness, critical dimension, or electrical performance) across subgroups, such as wafers or lots, over time to monitor process stability.
    • R chart (range chart): Tracks the variability within each subgroup, helping identify changes in process consistency, uniformity, or equipment performance.
    • Control limits: Statistically derived boundaries (typically ±3 standard deviations from the process mean) that define the expected range of normal process variation, distinct from engineering specification limits.
    • Run rules: Additional statistical tests, such as multiple consecutive points above or below the center line, used to detect subtle trends, shifts, or non-random patterns before they become significant process issues.

    Generating Synthetic Thin Film Thickness Data

    For this walkthrough, we simulate thickness measurements from a hypothetical LPCVD silicon nitride process. This is a synthetic dataset created for illustrative purposes: we model 25 subgroups (batches) with 5 wafers each, a target thickness of approximately 2000 angstroms, and an intentional drift introduced in the final batches.

    The synthetic data uses a mean of 2000 angstroms with a standard deviation of 15 angstroms, with a +5 angstrom per batch drift added in the last 5 batches. This mimics a real-world scenario where a chamber component such as a heating element begins to degrade, causing a gradual upward shift in deposited thickness.

    We deliberately avoid claiming specific crystallographic orientations or substrate materials for this simulated data. The process environment is modeled as a generic polycrystalline thin film on a crystalline substrate, using peak intensity parameters consistent with typical nitride film characterization.

    Building X-bar and R Charts in R

    The R code below uses the qcc package, one of the most widely used libraries for SPC analysis. If you do not have it installed, run install.packages("qcc") first.

    
    # Load package
    library(qcc)
    
    # Generate synthetic thickness data (angstroms)
    set.seed(42)
    n_batches <- 25
    n_wafers <- 5
    thickness <- matrix(nrow = n_batches, ncol = n_wafers)
    
    for (i in 1:n_batches) {
      drift <- ifelse(i > 20, (i - 20) * 5, 0)
      thickness[i, ] <- round(rnorm(n_wafers, mean = 2000 + drift, sd = 15), 1)
    }
    
    # Create qcc objects
    batch_thickness <- as.data.frame(thickness)
    
    # X-bar chart
    xbar_chart <- qcc(batch_thickness, type = "xbar",
                      title = "X-bar Chart: LPCVD Film Thickness",
                      xlab = "Batch Number", ylab = "Mean Thickness (A)")
    
    # R chart
    r_chart <- qcc(batch_thickness, type = "R",
                   title = "R Chart: LPCVD Film Thickness Range",
                   xlab = "Batch Number", ylab = "Range (A)")
    

    Interpreting the Control Charts

    When you run this code, the X-bar chart (shown below) will show all points within the upper and lower control limits for the first 20 batches. This indicates a stable, in-control process — exactly what a process engineer wants to see during routine production.

    Image

    Beginning around batch 21, the mean thickness will begin to climb above the center line. By batch 23, one or more points may fall above the upper control limit (UCL), signaling that the process has shifted. The R chart should remain stable throughout, indicating that the within-batch uniformity (wafer-to-wafer variation) has not changed — the problem is a shift in the mean, not an increase in variability.

    Applying Run Rules for Earlier Detection

    Standard control limits alone may not detect gradual drifts quickly enough. Run rules add sensitivity:

    • Rule 1: One point beyond the 3-sigma control limit.
    • Rule 2: Seven consecutive points on the same side of the center line.
    • Rule 3: Two out of three consecutive points beyond the 2-sigma warning limit.

    In our simulated data, the 7-point run rule (Rule 2) would flag the drift as early as batch 22, before any individual measurement exceeds the control limits. This is the practical value of SPC: catching problems before they produce out-of-spec material.

    R’s qcc package supports run rules via the rules argument. Adding rules = rulesets(c("rule1", "rule2", "rule3")) to the qcc() call will annotate violations directly on the chart.

    Practical Applications for Process Engineers

    SPC charts are not limited to thickness. They can be applied to any measurable property:

    • Refractive index from ellipsometry measurements across a wafer batch.
    • Film stress from wafer curvature measurements before and after deposition.
    • Sheet resistance for conductive thin films measured by four-point probe.
    • Uniformity calculated as (max – min) / (2 * mean) within a wafer.

    The same R code structure shown above works for any of these variables. Simply replace the thickness data with your measurement values and the control chart logic remains identical.

    Beyond Basic SPC: What Comes Next

    Once you have SPC charts running, the next step is often process capability analysis (Cpk), which compares the natural process variation to the specification limits. A Cpk value below 1.33 typically indicates that the process needs improvement. We will cover capability analysis in a future post.

    For readers interested in comparing multiple analytical approaches, our post on comparing logistic regression, SVMs, and random forests for classification demonstrates the kind of rigorous model comparison that complements SPC methodology in a broader data science toolkit.

    Conclusion

    SPC charts provide a straightforward, statistically grounded method for monitoring thin film deposition processes in real time. With just a few lines of R code using the qcc package, process engineers can detect drifts in mean thickness, identify changes in wafer-to-wafer variability, and trigger preventative maintenance before scrap material is produced. This blog uses a synthetic dataset to illustrate the workflow, but the same approach applies directly to real production data.

    The combination of X-bar charts, R charts, and run rules gives process engineers a practical early warning system. In future posts, we will extend this framework to multivariate SPC (Hotelling’s T-squared) and explore how control charts integrate with broader statistical methods such as Design of Experiments (DOE), which we will cover in a subsequent blog.

  • Predicting Diabetes: Comparing Logistic Regression, SVMs, and Random Forests

    While exploring Kaggle’s vast data science resources, I discovered an intriguing diabetes dataset and decided to develop a predictive model. The dataset structure is elegantly simple, featuring 8 independent variables and a target variable called Outcome, which identifies the presence or absence of diabetes. My objective is to create a robust model that can accurately predict diabetes based on these variables.

    Dataset Overview

    This comprehensive medical dataset contains diagnostic measurements specifically collected for diabetes prediction based on various health indicators. It encompasses 768 female patient records, with each record containing 8 distinct health parameters. The Outcome variable serves as the binary classifier, indicating diabetes presence (1) or absence (0). This dataset serves as an excellent resource for training and evaluating machine learning classification models in the context of diabetes prediction.

    • Pregnancies (Integer): Total pregnancy count for each patient.
    • Glucose (Integer): Post 2-hour oral glucose tolerance test plasma concentration (mg/dL).
    • BloodPressure (Integer): Measured diastolic blood pressure (mm Hg).
    • SkinThickness (Integer): Measured triceps skin fold thickness (mm).
    • Insulin (Integer): Measured 2-hour serum insulin levels (mu U/ml).
    • BMI (Float): Calculated body mass index using weight(kg)/height(m)^2.
    • DiabetesPedigreeFunction (Float): Calculated genetic diabetes predisposition score based on family history.
    • Age (Integer): Patient’s age in years.
    • Outcome (Binary): Target variable indicating diabetes (1) or no diabetes (0).

    This valuable dataset, adapted from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), is frequently utilized in data science research focusing on healthcare analytics and medical diagnostics.

    Exploratory Data Analysis

    Initial data examination reveals a significant concern: multiple variables including Blood Pressure, Glucose, Skin Thickness, Insulin, and BMI contain zero values. These likely represent measurement errors, necessitating their removal for accurate analysis.

    diabetes <- read.csv('diabetes_dataset.csv') #Read Data
    summary(diabetes)
    Diabetes Summary

    Post removal of zero values, our dataset reduces to 392 observations. Though smaller, this sample size remains adequate for developing a reliable predictive model.

    diabetes <- diabetes %>%
      filter(BloodPressure != 0) %>%
      filter(Glucose != 0) %>%
      filter(SkinThickness != 0) %>%
      filter(Insulin != 0) %>%
      filter(BMI != 0)
    hist(diabetes$Age)
    hist(diabetes$DiabetesPedigreeFunction)
    hist(diabetes$BMI)
    hist(diabetes$Insulin)
    hist(diabetes$SkinThickness)
    hist(diabetes$BloodPressure)
    hist(diabetes$Glucose)
    hist(diabetes$Pregnancies)
    Histograms

    Subsequently, we analyze the distribution patterns of all independent variables.

    The analysis reveals that most variables, except pregnancy, age, and insulin, deviate from normal distribution. The non-normal distribution of pregnancy data aligns with logical expectations.

    Data Preprocessing

    We implement these specific transformations:

    • Insulin: Box-Cox Transform
    • Age: Box-Cox Transform
    • Age: Square Root Transform
    #PreProcess Data to Get everything into Normal Distribution
    
    #Applying Box-Cox Transform on Age
    
    boxcox(diabetes$Age ~ 1)
    diabetes$boxcoxAge <- (diabetes$Age^-1.4 - 1)/-1.4
    hist(diabetes$boxcoxAge)
    
    #Applying Box-Cox Transform on Insulin
    
    boxcox(diabetes$Insulin ~ 1)
    diabetes$boxcoxInsulin <- (diabetes$Insulin^0.05 -1)/0.05
    hist(diabetes$boxcoxInsulin)
    
    #Applying Box-Cox Transform on Pregnancies
    diabetes$Pregnancies
    diabetes$sqrtPregnancies <- sqrt(diabetes$Pregnancies)
    hist(diabetes$sqrtPregnancies)

    Finally, we apply max-min scaling to normalize all values between 0 and 1, effectively preventing any data artifacts from influencing our analysis.

    #Storing relevant variables in a new dataframe and scaling the data
    
    diabetes.clean <- diabetes %>%
      dplyr::select(
        Outcome,
        DiabetesPedigreeFunction,
        sqrtPregnancies,
        SkinThickness,
        boxcoxInsulin,
        boxcoxAge,
        BMI,
        Glucose,
        BloodPressure
      )
      
    
    preproc <- preProcess(diabetes.clean, method = "range")
    scaled.diabetes.clean <- predict(preproc, diabetes.clean)
    
    head(scaled.diabetes.clean)
    str(scaled.diabetes.clean)
    scaled.diabetes.clean$Outcome <- as.factor(scaled.diabetes.clean$Outcome)
    

    Looking for correlations in the variables

    Upon analyzing the correlation matrix, the data science analysis reveals minimal significant correlations among the variables, suggesting we can proceed with treating them as independent predictors in our modeling approach.

    # Looking for Correlations within the Data
    
    num.cols <- sapply(scaled.diabetes.clean, is.numeric)
    cor.data <- cor(scaled.diabetes.clean[,num.cols])
    cor.data
    corrplot(cor.data, method = 'color')
    Corrplot

    Splitting the data into training and testing set

    For model development and evaluation, we implement a standard data partitioning strategy, allocating 70% of the observations to the training dataset and reserving the remaining 30% for testing purposes.

    #Splitting Data into train and test sets
    
    sample <- sample.split(scaled.diabetes.clean$Outcome, SplitRatio = 0.7)
    train = subset(scaled.diabetes.clean, sample == TRUE)
    test = subset(scaled.diabetes.clean, sample == FALSE)

    Building a Model

    Our predictive modeling approach incorporates three distinct machine learning techniques: Logistic Regression, Support Vector Machines, and Random Forests.

    Logistic Regression

    The logistic regression implementation yields a respectable accuracy of 77.12%. The model identifies Age, BMI, and Glucose as significant predictors of diabetes, with the diabetes pedigree function showing moderate influence. This suggests that while genetic predisposition plays a role, lifestyle factors remain crucial in diabetes prevention.

    Support Vector Machines

    Despite parameter tuning efforts, the Support Vector Machines algorithm demonstrates slightly lower performance, achieving 74.9% accuracy compared to the logistic regression model.

    Random Forest

    Random Forests emerge as the superior performer among the three approaches, delivering the highest accuracy at 79.56%.

    A critical observation across all models is the notably lower proportion of Type I errors. In this medical context, false negatives pose a greater risk than false positives, making this characteristic particularly relevant.

    Comparing the Models

    A comparative analysis of model performance metrics reveals Random Forests as the top performer, though there remains room for improvement. It’s worth noting that the necessity to exclude numerous observations due to measurement inconsistencies may have impacted model performance. While this model shows promise as a preliminary diabetes screening tool with reasonable accuracy, developing a more precise predictive model would require additional data points and refined measurements.

    Code

    ### LOGISTIC REGRESSION ###
    
    log.model <- glm(formula=Outcome ~ . , family = binomial(link='logit'),data = train)
    summary(log.model)
    fitted.probabilities <- predict(log.model,newdata=test,type='response')
    fitted.results <- ifelse(fitted.probabilities > 0.5,1,0)
    misClasificError <- mean(fitted.results != test$Outcome)
    print(paste('Accuracy',1-misClasificError))
    table(test$Outcome, fitted.probabilities > 0.5)
    
    ### SVM ####
    
    svm.model <- svm(Outcome ~., data = train)
    summary(svm.model)
    predicted.svm.Outcome <- predict(svm.model, test)
    table(predicted.svm.Outcome, test[,1])
    tune.results <- tune(svm, 
                         train.x = train[2:9], 
                         train.y = train[,1], 
                         kernel = 'radial',
                         ranges = list(cost=c(1.25, 1.5, 1.75), gamma = c(0.25, 0.3, 0.35)))
    summary(tune.results)
    tuned.svm.model <- svm(Outcome ~., 
                           data = train, 
                           kernel = "radial",
                           cost = 1.25,
                           gamma = 0.25,
                           probability = TRUE)
    summary(tuned.svm.model)
    print(svm.model)
    tuned.predicted.svm.Outcome <- predict(tuned.svm.model, test)
    table(tuned.predicted.svm.Outcome, test[,1])
    Table