Machine Learning – Lessons from our POC

Developer Support

In this post, App Dev Manager Michael Mano recaps lessons learned from a Machine Learning POC.

Data is key when it comes to machine learning, I had recently worked with a customer on a Proof of Concept and would like to share some of our learnings.

Model Accuracy

It is really important to a good data model, the more accurate your model the better your results. If your ML model is not giving the satisfying results, you need to work closely to improve the accuracy. In our case we had data in multiple datasets and it was hard to get the aggregate data from different systems to make our model, so we worked closely with a Data scientist to aggregate the data from multiple sources and applied them in ML.

Data Volume

Having more amount of data means you are giving more information to your machine learning algorithms to understand the various situations and correlate the same before giving the right answer.

And having more training data means, you need to add variety of data that can cover wide-ranging scenario to avoid the biased decisions. Hence, the more data you feed it will improve the accuracy of model.

Missing data

Missing data not only makes your model prone to more random errors, it often gives biased decisions to your model. It is imperative you handle these by mean or median values. For extreme cases you can either delete them or do transformation on the data.

Error estimation

An easy way of estimating the test error of a model, without the need for cross-validation is Out-of-Bag Error Estimation. The observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. We can simply predict the response for the observation using each of the trees in which that observation was OOB. We average those predicted responses, or take a majority vote, depending on if the response is quantitative or qualitative. An overall OOB MSE(mean squared error) or classification error rate can be computed. This is an acceptable test error rate because the predictions are based on only the trees that were not fit using that observation.


Using multiple algorithms and tuning the algorithms to find the optimum value for each parameter also improves the accuracy of the model. However, it is not necessary that higher accuracy models always give the accurate results, as sometimes, the improvement in model’s accuracy can be due to over-fitting too. It is imperative for us to have a clearly defined goal on how we derive successful model based on the user experience.

I also recommend you check out this blog.