Thrive to be a full-stack data scientist who can conduct rigorous research to uncover business insights and also deliver production-level code to build practical applications.
I'm Yingchi, born in China and currently working as a data scientist at Indeed, Singapore. I specialize in building production-level data science solutions with big data environment, with familiarity and hands-on working experience with classical ML methods like Logistic Regression, Random Forest, Boosting, NLP techniques and Neural networks.
Topics of interest: text mining, recommendation, neural network and more...
I love taekwondo 🥋, piano 🎹 and ice cream 🍦. And I'm keen to learn, experience and share.
Python R Scala SQL JavaScript SAS
Regression(Linear, Logistic) Classification(SVM, KNN, RF) Text Mining(Sentiment, NLP) Neural Networks
Applied NLP techniques (entity embeddings) with tree-based ML models to estimate job salary using structured as well as unstructured text features.
• Designed and developed the jobseeker salary inference pipeline including model (re)training with AWS SageMaker, model deployment by setting up REST and gRPC service from Python, and model monitoring with scheduled jobs.
• Build Python modules for text summarization and ranking to generate representative content items, using NLP techniques such as TextRank and Word2Vec.
• Prototype exploration and exploitation pipeline for dynamic ranking.
Part of the btc.com team.
• Provided data insights for cryptocurrency mining platforms and blockchain explorers using Airflow scheduled Spark jobs.
• Developed a transaction fee prediction engine using Neural Networks and Generalized Linear Models, building the end-to-end process from acquiring real-time data (Python parser with Redis and MySQL) to training and evaluating models.
• Generated internal data reports using Spark SQL, Hive and graph databases like neo4j.
Work in the application team.
• Researched on footfall analytics with telco data using machine learning algorithms such as Naive Bayes, Logistic Regression, and Random Forests. Implemented and productionized models into our data analytics platform using Python. Submitted two research papers based on that with one published..
• Designed and develop the network planning application for telco operators to reduce upgrading cost while improving customer experience. The application was built with Scala and deployed in a big-data environment with Hadoop and Spark.
• Established the pipeline of internal metrics reporting by understanding the raw data, current data management system and the requirements from various team leaders
• Produced dashboards on system and business performance to enable stakeholders to make effective decisions, using Chartio and SQL
• Assisted engineering teams in database design
• Conducted geolocation data analysis projects to undercover new features and improve model accuracy by running Hadoop and Spark jobs; implemented reproducible code using R Markdown and Python for the projects.
• Built interactive data visualizations (Web apps) using JavaScript, Node.js and React for internal and external clients.
• Prepared Budweiser's 2015 Q1 report which was well received by the client; discovered unusual patterns from data and initiated deep dive research to find explanations.
• Collected and complied the consumer survey data weekly using SPSS Survey Reporter.
Main courses taken:
Neural Networks and Deep Learning (CS5242)
Big-Data Analytics Technology (CS5344)
Phenomena and Theories of Human-Computer Interaction (CS4249)
Text Mining (CS5246)
Knowledge Discovery and Data Mining (CS5228)
Uncertainty Modelling in AI (CS5340)
Main courses taken:
Mining Web Data for Business Insights (BT4222) | Search Engine Optimization & Analytics (BT4212)
Data Mining (ST4240) | Business Intelligence Systems (IS4240)
Stochastic Models in Management (DSC3215) | Computational Methods for Business
Analytics (BT3102) | Statistical Methods for Finance (ST4245)
Social Media Network Analysis (IS4241) | Simulation (ST3247)
Stochastic Process (ST3236) | Regression Analysis (ST3131)
2017 IEEE 18th International Conference on Mobile Data Management (MDM)
A Chinese text generator using RNN (Recurrent Neural Network) and LSTM (Long-short Term Memory) layers. The training text is Modu 《默读》, a popular web fiction in Chinese.