Data Scientist & ML Engineer
Redefining what's possible with data — one model, one insight, one decision at a time.
I'm Omkar Shashank Pathare — a Data Scientist and ML Engineer pursuing my Master's in Data Science
at Stevens Institute of Technology. I engineer scalable ML solutions across finance, healthcare, and
supply chain — from fine-tuning LLMs with QLoRA to production-grade Azure MLOps pipelines and
recommendation engines that genuinely move the needle.
My toolkit: Python, R, SQL, Scikit-learn, TensorFlow, Hugging Face, Apache Spark, Azure, and
everything in between. When I'm not in the data, I'm an F1 fan who occasionally takes it too far.
Eight end-to-end ML projects spanning NLP, computer vision, recommendation systems, big data, and time-series forecasting.
Fine-tuned a Qwen-7B large language model using QLoRA (Quantized Low-Rank Adaptation) on a curated dataset of 10,000 cleaned SVG–natural language prompt pairs sourced from Kaggle competition data in .csv format. The pipeline included SVG canonicalization — normalizing path commands, stripping metadata, and enforcing consistent coordinate systems — alongside structural validation to discard malformed or non-renderable SVGs before training.
The fine-tuned model learned to generate structured SVG markup from free-form English descriptions of geometric shapes and compositions, significantly outperforming zero-shot Qwen-7B on prompt-to-shape alignment. The training regime used 4-bit quantization to fit the 7B parameter model on a single A100 GPU, with LoRA applied to all attention projection layers. Inference benchmarking showed improved shape accuracy and path coherence over the base model across blind test prompts.
Engineered a hybrid recommendation engine that fuses content-based and collaborative filtering signals to match job seekers with relevant roles. The content-based component uses TF-IDF vectorization on job descriptions and résumé text with cosine similarity scoring, capturing lexical overlap between candidate profiles and postings. The collaborative component applies Singular Value Decomposition (SVD) on a user–item interaction matrix built from 1,000+ applicant profiles and application histories.
A dynamic weighting mechanism blends both signals at inference time, shifting emphasis toward collaborative filtering for users with sufficient interaction history and toward content-based scoring for cold-start users. This cold-start mitigation was the core design challenge. Final evaluation against a held-out test set showed +17% precision, +14% recall, and +19% F1-score compared to a TF-IDF only baseline. The system was implemented in Python using Scikit-learn with a preprocessing pipeline that handles noisy text, inconsistent formatting, and skill-level normalization.
Developed a multi-class classification pipeline to identify levels of problematic internet usage severity in adolescents using a dataset of 1,500+ records containing behavioural, psychological, and physical health features from the Healthy Brain Network study. Two primary classifiers were evaluated: a Decision Tree optimized via CART criterion with max-depth tuning, and a Softmax Logistic Regression with L2 regularization.
A key methodological contribution was the use of K-Means clustering (k=4, selected via elbow method and silhouette score) as a feature engineering step — cluster assignments were appended as a categorical feature to the training data, introducing unsupervised structure that improved model performance by 7.5% over raw-feature baselines. The final ensemble achieved 98.7% accuracy on the held-out test set. Class imbalance was handled using SMOTE oversampling on the minority classes. Model interpretability was examined via feature importance plots, revealing that parental monitoring, sleep duration, and time spent on social media were the strongest predictors across classes.
Applied Principal Component Analysis to 23 financial and behavioral features from a 30,000+ record Taiwanese credit dataset to reduce dimensionality for binary default classification. PCA extracted 8 principal components that collectively explained 96.3% of the total variance — compressing the feature space by 65% while retaining nearly all predictive signal. This reduction cut logistic regression training time by 41% and reduced susceptibility to multicollinearity common in correlated financial features (e.g., payment history across consecutive months).
The PCA decomposition was computed via Singular Value Decomposition on standardized features (zero mean, unit variance). The final logistic regression classifier achieved 81.4% overall accuracy, an AUC of 0.88 on the ROC curve, and 15% improved recall specifically for default cases — the minority class — which is the critical outcome in credit risk modeling. The project also explored the interpretability of principal components by examining their loadings against original features, linking PC1 predominantly to payment delay history and PC2 to credit utilization ratio.
Implemented the classical Eigenfaces approach for facial recognition on the Yale and ORL benchmark datasets, combining PCA-based dimensionality reduction with an SVM classifier. Raw face images (128×128 grayscale pixels) were vectorized into 16,384-dimensional feature vectors, then PCA was applied to compute the eigenvectors of the covariance matrix — reducing dimensionality by 99%, retaining only the top-k eigenvectors (Eigenfaces) that captured maximum inter-class variance.
The reduced-dimension representations were used as inputs to a multi-class SVM with an RBF kernel, tuned via grid search over C and gamma. The Eigenfaces approach achieved 94.2% recognition accuracy, outperforming the raw-pixel SVM baseline by approximately 10 percentage points — demonstrating the power of feature extraction in high-dimensional image spaces with limited training data. The project included thorough visualization of top Eigenfaces and a confusion matrix analysis showing which individuals were most frequently confused under various lighting conditions.
Designed and deployed a scalable Apache Spark ETL pipeline to ingest, clean, and normalize millions of daily listing and calendar records from the Inside Airbnb open dataset across multiple U.S. cities. The ingestion stage handled messy real-world issues including inconsistently formatted price strings ($1,200/night vs 1200.0), missing amenity fields, irregular date formats, and duplicate listing entries from host re-submissions.
After cleaning, log transformations were applied to right-skewed price and occupancy distributions, and engineered features were created for borough/region indicators, room type dummies, bedroom/bathroom counts, and amenity richness scores. Log-log regression models were fitted to estimate price elasticity: location explained 11–25% of listing price variability (R² by borough), with East Coast markets showing higher mean prices but tighter variance. Demand elasticities were consistently negative across all markets, confirming standard economic theory. Scaling the pipeline from a single-node local Spark instance to a 4-worker cluster delivered approximately 80% wall-clock runtime reduction on the full dataset, achieved through partition optimization and broadcast joins for small lookup tables.
Built a cloud-native, production-grade image-sharing platform using Python and Docker, orchestrated entirely on Microsoft Azure. The backend architecture follows a microservices pattern: Azure SQL / SQL Server handles user authentication and session management; Cosmos DB stores image metadata and user-generated tags; Azure Blob Storage persists raw image binaries with CDN delivery; Table Storage logs audit trails and access events for compliance.
Serverless Azure Functions run statelessly on demand and integrate with Cosmos DB change-feed triggers to implement a fully event-driven ML pipeline: when a new photo is uploaded, a queue-based microservice automatically sanitizes the image (removing EXIF metadata), resizes it to standardized resolutions, and runs an image classification model to auto-tag content before persisting results back to Cosmos DB. Secrets and API keys are managed via Azure Key Vault with managed identities — no secrets in code or environment variables. The full application is containerized with Docker and deployed to Azure App Service with auto-scaling rules, maintaining sub-200ms latency under simulated load of 500 concurrent users.
Developed a time-series forecasting framework in R to model Ferrari Formula 1 pit-stop durations and lap times across multiple race weekends. The project began with exploratory data analysis: visualizing pit-stop duration distributions by circuit and year, checking for trend, seasonality, and calendar effects (e.g., safety car periods, weather-induced outliers).
Stationarity was tested using Augmented Dickey-Fuller and KPSS tests; non-stationary series were corrected via first-order differencing and log transformation to stabilize variance. Autocorrelation (ACF) and partial autocorrelation (PACF) plots informed initial ARIMA order selection, which was then formalized using auto.arima — selecting optimal (p,d,q)(P,D,Q)[s] parameters by minimizing AIC/BIC. Models were validated through time-series k-fold cross-validation with expanding windows, and residual diagnostics confirmed no remaining autocorrelation (Ljung-Box test). The resulting SARIMA models captured seasonal race patterns and delivered accurate lap-time forecasts evaluated via RMSE and MAE across held-out races. Strategic implications for pit-stop timing windows were extracted from the forecast intervals.
Engineered an end-to-end supply chain analytics pipeline to process and analyze over 1 million sales records using SQL, surfacing a 15% revenue decline across 5 key regions tied to seasonal demand misalignment and distribution bottlenecks. The analysis combined regional segmentation, product-level trend decomposition, and inventory turnover metrics to pinpoint root causes at granular geographic and SKU levels.
Built 10+ Tableau dashboards covering inventory velocity, sales trend heatmaps, demand forecasting overlays, and KPI tracking — adopted by operations leadership for weekly strategic reviews. Integrated historical trend analysis into supply chain planning models, improving forecast accuracy by 20% compared to prior manual estimation methods. The project demonstrated how structured BI tooling and SQL-driven analytics can directly translate into operational decision-making improvements in logistics and inventory management.
"It doesn't matter where you start —
it's about the impact you leave in the data."
— Omkar Pathare
A full-stack ML toolkit spanning languages, frameworks, cloud platforms, and domain expertise.
Industry-recognized credentials validating expertise across big data, cloud, and analytics.
Two industry internships spanning data analytics and applied machine learning.
Analyzed 1M+ sales records using SQL, uncovering a 15% decline across 5 key regions and surfacing root causes tied to seasonal demand shifts and distribution gaps.
Improved forecast accuracy by 20% by integrating historical trend analysis into supply chain planning models used by the operations team.
Engineered 10+ Tableau dashboards visualizing inventory velocity, regional sales trends, and KPI performance — adopted by leadership for weekly strategic reviews.
Built breast cancer classification models using Python and Scikit-learn, boosting accuracy by 15% through SVM kernel selection and hyperparameter optimization via grid search.
Processed and analyzed RNA-Seq expression data from 800+ tumor samples — handling normalization, batch correction, and feature selection for high-dimensional genomic inputs.
Enhanced model evaluation workflows by implementing stratified k-fold cross-validation and ROC-AUC metrics, improving reporting reliability for clinical research stakeholders.
Open to full-time Data Scientist and ML Engineer roles starting May 2026. Let's connect.