The notes provide an introduction to predictive modeling, focusing on statistical models and practical applications using R. The course emphasizes understanding the intuition behind different predictive methods and applying them to real datasets. The concept of predictive modeling is introduced as creating mathematical models to predict a variable of interest Y based on predictor variables X1,…,Xp. Examples include predicting house prices, or failure probabilities. The general modeling framework is Y = m(X1, …,Xp) + \epsilon, where m is the unknown regression function and \epsilon is the random error. Several key considerations are highlighted: 1. Prediction Accuracy vs. Interpretability: Models can be complex and accurate but difficult to interpret (e.g., black-box models) or simpler and interpretable with potentially lower accuracy. 2. Model Correctness vs. Usefulness: A model may be theoretically correct but not practically useful, or vice versa. 3. Flexibility vs. Simplicity: There is a trade-off between underfitting and overfitting, addressed using training/testing splits and the bias–variance trade-off. The document covers: • Linear Models (Simple & Multiple): Foundational tools for modeling linear relationships, parameter estimation using least squares, and interpretation of coefficients. • Model Selection & Diagnostics: How to choose predictors, handle categorical variables, capture nonlinearities, and check assumptions. • Advanced Linear Techniques: Shrinkage methods (e.g., Ridge, Lasso), constrained models, multivariate responses, and considerations for big data. • Generalized Linear Models (GLMs): Extending linear models to handle non-normal response variables, including logistic regression, deviance, and model selection. • Nonparametric Regression: Flexible methods like kernel regression and density estimation for situations where functional forms are unknown. Practical aspects include using R and RStudio, running reproducible code snippets, and working with provided datasets such as Boston housing prices, and the Challenger disaster data. Appendices review hypothesis testing, estimation techniques, multinomial logistic regression, and handling missing data. Link: https://lnkd.in/eNBedNnB #statistics #predictivemodeling #r
Statistical Modeling in Engineering
Explore top LinkedIn content from expert professionals.
Summary
Statistical modeling in engineering involves using mathematical techniques to predict, explain, and manage uncertainty in real-world systems, supporting decision-making from product design to risk assessment. These models help engineers interpret data, quantify risks, and forecast outcomes, making complex problems easier to understand and solve.
- Focus on uncertainty: Recognize that many engineering problems, like predicting floods or product failures, rely on probability rather than fixed values, so always account for randomness in your analysis.
- Choose models wisely: Select statistical models based on your specific data quality, problem needs, and interpretability requirements—sometimes simpler models provide more reliable insights than complex ones.
- Validate and test: Use methods like cross-validation and diagnostic checks to ensure your models perform well on new data, reducing the risks of overfitting or misinterpretation.
-
-
🌧️ 𝗪𝗵𝘆 𝗗𝗲𝘀𝗶𝗴𝗻 𝗦𝘁𝗼𝗿𝗺𝘀 𝗮𝗻𝗱 𝗙𝗹𝗼𝗼𝗱 𝗗𝗶𝘀𝗰𝗵𝗮𝗿𝗴𝗲 𝗠𝘂𝘀𝘁 𝗕𝗲 𝗧𝗿𝗲𝗮𝘁𝗲𝗱 𝗮𝘀 𝗥𝗮𝗻𝗱𝗼𝗺 𝗩𝗮𝗿𝗶𝗮𝗯𝗹𝗲𝘀 Rainfall and floods are not deterministic quantities. They are random processes governed by probability. Yet in applied design, particularly in data-scarce regions, probabilistic information is susceptible to being reduced to a single "design" value, i.e., one rainfall depth, one peak discharge, and one representative storm, and then treated deterministically throughout the modelling process. 𝗧𝗵𝗲 𝗶𝘀𝘀𝘂𝗲 𝗶𝘀 𝗻𝗼𝘁 𝘁𝗵𝗲 𝘂𝘀𝗲 𝗼𝗳 𝗿𝗲𝘁𝘂𝗿𝗻 𝗽𝗲𝗿𝗶𝗼𝗱𝘀. 𝗧𝗵𝗲 𝗶𝘀𝘀𝘂𝗲 𝗶𝘀 𝘁𝗵𝗲 𝗹𝗼𝘀𝘀 𝗼𝗳 𝘀𝘁𝗼𝗰𝗵𝗮𝘀𝘁𝗶𝗰 𝗰𝗼𝗻𝘁𝗲𝘅𝘁. 🧠 𝗔 𝗦𝘁𝗼𝗰𝗵𝗮𝘀𝘁𝗶𝗰 𝗛𝘆𝗱𝗿𝗼𝗹𝗼𝗴𝘆 𝗣𝗲𝗿𝘀𝗽𝗲𝗰𝘁𝗶𝘃𝗲 From a stochastic standpoint, both extreme rainfall and flood discharges are realised based on underlying probability distributions, not fixed truths. Every design value carries uncertainty, arising from: • Limited record length; Sampling variability of extremes; Choice of probability distribution; Extrapolation beyond observed data Once this uncertainty is ignored, model outputs may appear precise while being statistically fragile. 📉 𝗪𝗵𝘆 𝗧𝗵𝗶𝘀 𝗠𝗮𝘁𝘁𝗲𝗿𝘀 𝗘𝘃𝗲𝗻 𝗠𝗼𝗿𝗲 𝗶𝗻 𝗗𝗮𝘁𝗮-𝗦𝗰𝗮𝗿𝗰𝗲 𝗥𝗲𝗴𝗶𝗼𝗻𝘀 Where rainfall or flow records are short or incomplete: • Extremes are poorly observed • Parameter uncertainty is high • Return-period estimates are highly sensitive to distribution choice In such contexts, fitting appropriate probability distributions to rainfall extremes and flood discharges is not optional; 𝗶𝘁 𝗶𝘀 𝗲𝘀𝘀𝗲𝗻𝘁𝗶𝗮𝗹. 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻 𝗳𝗶𝘁𝘁𝗶𝗻𝗴 𝗮𝗹𝗹𝗼𝘄𝘀 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 𝘁𝗼: ✔ Quantify uncertainty explicitly ✔ Compare alternative statistical models (e.g. GEV, Log-Pearson, Gamma) ✔ Understand sensitivity of design values to data limitations ✔ Support risk-informed, rather than purely deterministic, decisions 📊 𝗟𝗶𝗻𝗸𝗶𝗻𝗴 𝗣𝗿𝗼𝗯𝗮𝗯𝗶𝗹𝗶𝘁𝘆, 𝗛𝘆𝗱𝗿𝗼𝗹𝗼𝗴𝘆, 𝗮𝗻𝗱 𝗚𝗲𝗼𝘀𝗽𝗮𝘁𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 𝗚𝗲𝗼𝘀𝗽𝗮𝘁𝗶𝗮𝗹 𝗶𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 helps us understand where floods occur through terrain, land use, drainage networks, and flow pathways. 𝗦𝘁𝗼𝗰𝗵𝗮𝘀𝘁𝗶𝗰 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 helps us understand how often and how severe those floods may be. 𝗪𝗵𝗲𝗻 𝗰𝗼𝗺𝗯𝗶𝗻𝗲𝗱: • Geospatial intelligence defines spatial exposure and system behaviour • Probability distributions define likelihood and risk • Together, they form the foundation of defensible flood risk assessment. 📌 𝗕𝗼𝘁𝘁𝗼𝗺 𝗟𝗶𝗻𝗲 Uncertainty is not a weakness in flood hydrology; it is information. Design storms and flood discharges should be estimated, tested, and interpreted probabilistically, especially where data are limited and the consequences of under-design are high 💬 𝗪𝗵𝗮𝘁 𝗳𝗿𝗲𝗾𝘂𝗲𝗻𝗰𝘆 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗼𝗿 𝘂𝗻𝗰𝗲𝗿𝘁𝗮𝗶𝗻𝘁𝘆 𝗺𝗲𝘁𝗵𝗼𝗱𝘀 𝗱𝗼 𝘆𝗼𝘂 𝗮𝗽𝗽𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝘄𝗼𝗿𝗸? 𝗟𝗲𝘁’𝘀 𝗱𝗶𝘀𝗰𝘂𝘀𝘀.
-
Have you wondered why #ML-based models often remain a proof-of-concept in the field of structural engineering, and are not deployed like CS? In this paper with Brennan Bean (Math/Stat, Utah State University), Henry V Burton (SE, UCLA), and M. Z. Naser (SE, AI Institute, Clemson University), we looked at why these models can fail in deployment with a particular interest in #generalizability and #explainability issues. We believe this paper can highlight some of the pitfalls of building ML, where a "seemingly" accurate ML model will not have much deployment values. A few thoughts from the paper: (1) We showed that we can learn more about model generalizability when we use a "stratified" cross-validation instead of a random cross-validation (See statistical methodology by Dr.Bean's group). In our study, simpler linear regression models showed better generalizability than advanced ML models for the intended task. (2) You may have heard of #overfitting and #underfitting, but have you heard of "#underspecification" (see work by Alexander D'Amour et al.)? This is where the ML model explains the same output using different sets of features with similar accuracy. We showed how this phenomenon affects model explainability and generalizability and will result in erroneous interpretations. (3) We also revisited "omission bias" and showed if you miss a "physically" important feature, you often end up "overcompensating" with a lot of marginally "irrelevant" features. While these features may show feature importance, they do not help with the model prediction (and you lose a lot of generalizability) This has been a challenging paper to write (judging based on two years of review and production), as we intended to invite the community to scrutinize how ML models are built. We hope this paper opens up some conversation on the state of the art of these approaches. I'm working on data and codes to be uploaded very soon. Stay tuned! https://lnkd.in/g7yDh2ag
-
Today I want to highlight a powerful piece of research from Rebellion Research that goes beyond headlines and examines one of the most discussed engineering failures in modern history through the lens of data. A Statistical Analysis of the NASA - National Aeronautics and Space Administration Challenger Accident takes the O-ring failure data from the Space Shuttle Challenger and applies rigorous statistical methods to quantify risk and reveal what the data was signaling long before the tragedy occurred. This analysis shows how incomplete data selection and intuition alone can lead to grave errors. By applying permutation tests and logistic regression to historical O-ring performance across temperatures, the authors demonstrate that low temperatures were strongly associated with higher failure probabilities. In fact, the model indicates a near-certain failure risk at the actual launch temperature — a stark numerical contrast to the assumptions made at the time. The piece is not just about aerospace history. It is a reminder of how statistical thinking and rigorous analysis matter in all technical fields, including finance, machine learning, risk management, and decision making under uncertainty. Understanding and correctly interpreting data can mean the difference between sound choices and disastrous outcomes. That’s the kind of thinking we strive to bring to every project at Rebellion Research. If you are interested in how data can illuminate risk, human decision processes, and organizational behavior in high-stakes contexts, this is a must-read. It is part of our broader mission to make research accessible, thoughtful, and relevant across disciplines. Read it here: https://lnkd.in/e7fpcREY Samson Qian Great work!
-
Accelerated Life Testing (ALT) is a critical methodology that enables engineers to swiftly predict product performance over time by exposing them to extreme conditions that exceed typical field experiences. This technique helps in identifying failure modes and quantifying life characteristics within a compressed timeframe. By utilizing tools such as the Arrhenius and Inverse Power models to analyze ALT results, it becomes feasible to forecast product durability in actual operational environments. These predictions are powerful but depend on appropriate model selection and reasonable assumptions regarding real-world stresses. The capability to rapidly benchmark and enhance reliability through ALT, supported by safety margins rooted in statistical confidence, has established itself as a fundamental aspect of modern reliability engineering.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development