<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Lakshmi’s Substack]]></title><description><![CDATA[My personal Substack]]></description><link>https://lakshmimahabaleshwara.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!PePq!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332b9b50-a1da-4d59-a2d5-b3cb69123dbe_144x144.png</url><title>Lakshmi’s Substack</title><link>https://lakshmimahabaleshwara.substack.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 05 Jun 2026 14:38:08 GMT</lastBuildDate><atom:link href="https://lakshmimahabaleshwara.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Lakshmi]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[lakshmimahabaleshwara@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[lakshmimahabaleshwara@substack.com]]></itunes:email><itunes:name><![CDATA[Lakshmi's Notebook]]></itunes:name></itunes:owner><itunes:author><![CDATA[Lakshmi's Notebook]]></itunes:author><googleplay:owner><![CDATA[lakshmimahabaleshwara@substack.com]]></googleplay:owner><googleplay:email><![CDATA[lakshmimahabaleshwara@substack.com]]></googleplay:email><googleplay:author><![CDATA[Lakshmi's Notebook]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Data Preprocessing for Machine Learning]]></title><description><![CDATA[From messy real-world data to model-ready inputs &#8212; an intuitive exploration of Feature Engineering, Missing Values, Scaling, and Normalization through child-inspired analogies]]></description><link>https://lakshmimahabaleshwara.substack.com/p/data-preprocessing-for-machine-learning</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/data-preprocessing-for-machine-learning</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Thu, 16 Apr 2026 16:04:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!mkUj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Introduction</h1><p style="text-align: justify;">Most ML tutorials jump straight into models and algorithms. But here's the thing no one tells beginners: <strong>the model is not where the magic happens. The data preparation is.</strong></p><p style="text-align: justify;">Think of it this way. You wouldn't hand a toddler a jigsaw puzzle with missing pieces, warped edges, and pieces from three different puzzles mixed, and then blame the toddler for not solving it. That's exactly what happens when we feed raw, messy data into a machine learning model and wonder why it performs terribly.</p><p style="text-align: justify;">Data preprocessing is the act of preparing raw data so a model can actually make sense of it. It covers everything from creating useful features to handling gaps in the data to making sure numbers play fair with each other. In this piece, we&#8217;ll walk through four pillars of preprocessing &#8212; <strong>Feature Engineering, Handling Missing Values, Scaling, and Normalization</strong>. Not as dry mathematical procedures, but as intuitive steps you&#8217;ve already seen in how children learn to organize their world.</p><p>By the end, you won&#8217;t just know what these terms mean. You&#8217;ll understand <strong>why they exist, when they matter, and what breaks when you skip them</strong>.</p><h1>Why Preprocessing Matters</h1><p style="text-align: justify;">A machine learning model is only as good as the data it receives.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mkUj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mkUj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!mkUj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!mkUj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!mkUj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mkUj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg" width="1456" height="433" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/faac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:433,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:265734,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/194272657?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mkUj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!mkUj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!mkUj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!mkUj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaac014e-81c1-4769-80d4-1abb792e96e2_3360x1000.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;">Raw data from the real world is messy. It has gaps. It has inconsistencies. Some numbers are in the thousands, others are tiny decimals. Some information is buried inside other information and needs to be extracted before it becomes useful.</p><p style="text-align: justify;"><strong>Preprocessing is the bridge between raw chaos and structured learning.</strong></p><h1>The Four Pillars of Data Preprocessing</h1><p>Each pillar solves a specific problem in your data. Skip any one of them, and your model pays the price.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_WJ4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_WJ4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_WJ4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_WJ4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_WJ4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_WJ4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg" width="1456" height="607" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:607,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:576101,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/194272657?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_WJ4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_WJ4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_WJ4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_WJ4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7decff22-7058-436e-8de4-3a8f0920397f_3360x1400.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>1. Feature Engineering: </h2><p><em><strong>Teaching the Machine What to Notice</strong></em></p><ul><li><p><strong>The Child:</strong> A child walks into a pet store. They see animals everywhere. But their parents point and say, &#8220;Look at the size. Look at the fur. Does it have a tail?&#8221; The parent isn&#8217;t changing the animals; they&#8217;re teaching the child <strong>what to pay attention to</strong>.</p></li><li><p><strong>The Machine:</strong> Raw data often contains information, but not in a form the model can use directly. Feature engineering is the process of <strong>creating, transforming, or selecting</strong> the right input variables (features) so the model can actually find patterns.</p></li><li><p><strong>Why it matters:</strong> A column called <em>&#8220;date_of_birth&#8221;</em> is useless to a model predicting insurance risk. But a new column called <em>&#8220;age&#8221;</em> was derived from that date? Now the model has something it can work with.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZCnE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZCnE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZCnE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZCnE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZCnE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZCnE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:383999,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/194272657?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZCnE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZCnE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZCnE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZCnE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d9332-d2e6-49d7-897b-0defc8ab00db_3360x1340.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>What Feature Engineering Looks Like in Practice</h3><p style="text-align: justify;"><strong>Creating new features from existing ones:</strong> You have a &#8220;timestamp&#8221; column. You extract <em>&#8220;hour_of_day&#8221;, &#8220;day_of_week&#8221;, and &#8220;is_weekend&#8221;</em>. Suddenly, a model predicting restaurant traffic has useful signals instead of a raw timestamp it can&#8217;t interpret.</p><p style="text-align: justify;"><strong>Combining features:</strong> You have <em>&#8220;house_length&#8221; and &#8220;house_width&#8221;</em>. Neither alone tells the full story. Multiply them to get &#8220;house_area&#8221;, a single feature that captures what two couldn&#8217;t.</p><p style="text-align: justify;"><strong>Encoding categories:</strong> A column says <em>&#8220;Red&#8221;, &#8220;Blue&#8221;, &#8220;Green&#8221;.</em> A model doesn&#8217;t speak English. You convert these to numbers the model can process; one-hot encoding turns &#8220;Red&#8221; into [1, 0, 0], &#8220;Blue&#8221; into [0, 1, 0], and so on.</p><p style="text-align: justify;"><em><strong>The Takeaway:</strong></em> <em>Feature engineering is the art of translating human knowledge into a language the model can learn from. The model finds patterns, but you decide what it gets to look at.</em></p><h2>2. Handling Missing Values: </h2><p><em><strong>Filling in the Gaps</strong></em></p><ul><li><p style="text-align: justify;"><strong>The Child:</strong> A child is reading a picture book, and one page is torn out. They don&#8217;t throw away the entire book. They might guess what happened on that page based on the story before and after. Or they skip it and keep reading.</p></li><li><p style="text-align: justify;"><strong>The Machine:</strong> Real-world datasets almost always have missing values. Sensors fail. People skip survey questions. Records get corrupted. The model needs a strategy: <strong>fill the gap, remove the gap, or flag the gap</strong>.</p></li><li><p style="text-align: justify;"><strong>Why it matters:</strong> Most ML algorithms cannot process empty cells. If you don&#8217;t handle missing values, your model either crashes or silently learns the wrong thing.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bQyd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bQyd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!bQyd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!bQyd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!bQyd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bQyd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg" width="1456" height="607" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:607,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:372492,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/194272657?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bQyd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!bQyd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!bQyd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!bQyd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad365d-9850-4656-8804-b2f126411b65_3360x1400.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>The Three Strategies for Missing Data</h3><p style="text-align: justify;"><strong>Strategy 1 &#8212; Remove it (Drop):</strong> If only a tiny fraction of your data has gaps, sometimes the simplest move is to drop those rows. Like a child skipping the torn page, you lose a little, but the rest of the story still makes sense. <em>Risk: </em>If too many rows have gaps, you lose valuable data.</p><p style="text-align: justify;"><strong>Strategy 2 &#8212; Fill it (Impute):</strong> Replace the missing value with something reasonable. The most common approaches: use the <strong>mean</strong> (average) for numerical data, the <strong>mode</strong> (most frequent value) for categorical data, or the <strong>median</strong> if your data has extreme outliers. Like the child guessing what happened on the missing page based on context.</p><p style="text-align: justify;"><strong>Strategy 3 &#8212; Flag it (Indicator):</strong> Create a new column that says &#8220;this value was missing&#8221; (1 or 0). This way, the model knows the gap existed and can learn whether the <em>absence</em> of data is itself a signal. Sometimes, the fact that a patient skipped a question on a health survey is more informative than any answer they could have given.</p><p><em><strong>The Takeaway:</strong></em> <em>Missing data isn&#8217;t a disaster; it&#8217;s a decision point. How you handle it depends on how much is missing, why it&#8217;s missing, and what your model needs.</em></p><h2>3. Feature Scaling: </h2><p><em><strong>Making the Numbers Play Fair</strong></em></p><ul><li><p style="text-align: justify;"><strong>The Child:</strong> Imagine two children comparing their collections. One child has 3 seashells. The other has 3,000 stickers. If you ask, &#8220;Who has the bigger collection?&#8221; the answer seems obvious, but is it fair? The <em>scales</em> are completely different. To compare meaningfully, you&#8217;d need to put both collections on the same measuring system.</p></li><li><p style="text-align: justify;"><strong>The Machine:</strong> ML models that calculate distances or gradients (like KNN, SVM, or neural networks) are heavily influenced by the magnitude of numbers. A feature ranging from 0 to 1,000,000 will <strong>dominate</strong> a feature ranging from 0 to 1, even if the smaller feature is more important.</p></li><li><p style="text-align: justify;"><strong>Why it matters:</strong> Without scaling, large-magnitude features bully small-magnitude features into irrelevance. The model doesn&#8217;t know that &#8220;salary in rupees&#8221; and &#8220;years of experience&#8221; should carry equal weight; it just sees big numbers and small numbers.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gODB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gODB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg 424w, https://substackcdn.com/image/fetch/$s_!gODB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg 848w, https://substackcdn.com/image/fetch/$s_!gODB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!gODB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gODB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:415368,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/194272657?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gODB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg 424w, https://substackcdn.com/image/fetch/$s_!gODB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg 848w, https://substackcdn.com/image/fetch/$s_!gODB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!gODB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae72500-294d-4e5f-b87e-ab2f082b0e81_3360x1340.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>The Two Main Scaling Methods</h3><p style="text-align: justify;"><strong>Min-Max Scaling (Normalization to a range):</strong> Squishes all values into a fixed range, typically 0 to 1. The formula: (value - min) / (max - min). Like converting both children&#8217;s collections to a percentage of their personal maximum. Works well when you need bounded values and your data doesn&#8217;t have extreme outliers.</p><p style="text-align: justify;"><strong>Standardization (Z-score Scaling):</strong> Centers the data around 0 with a standard deviation of 1. The formula: (value - mean) / standard_deviation. Like grading on a curve &#8212; every student&#8217;s score is expressed as &#8220;how far from average.&#8221; Works well when your data has outliers or when your algorithm assumes normally distributed data.</p><p style="text-align: justify;"><em><strong>The Takeaway:</strong></em> <em>Scaling doesn&#8217;t change what your data says. It changes how loudly each feature speaks, so no single feature drowns out the others.</em></p><h2>4. Normalization: </h2><p><em><strong>Reshaping the Distribution</strong></em></p><ul><li><p style="text-align: justify;"><strong>The Child:</strong> A teacher asks the class to rate how much they liked a movie on a scale of 1 to 10. One child always gives everything a 9 or 10. Another child&#8217;s ratings spread evenly from 1 to 10. To compare their opinions fairly, you&#8217;d need to adjust for each child&#8217;s <em>personal rating style</em> and their distribution.</p></li><li><p style="text-align: justify;"><strong>The Machine:</strong> Normalization transforms the <em>shape</em> of your data&#8217;s distribution. Even after scaling, your data might be heavily skewed, with most values clustered on one side with a long tail. Many ML algorithms perform better when the data follows a more symmetric, bell-curve-like distribution.</p></li><li><p style="text-align: justify;"><strong>Why it matters:</strong> Skewed distributions can mislead models. If 95% of your income data clusters between &#8377;20,000 and &#8377;80,000 but a few values shoot up to &#8377;50,00,000, the model might overfit to those extremes or underperform on the majority.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BwGg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BwGg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg 424w, https://substackcdn.com/image/fetch/$s_!BwGg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg 848w, https://substackcdn.com/image/fetch/$s_!BwGg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!BwGg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BwGg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:436998,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/194272657?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BwGg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg 424w, https://substackcdn.com/image/fetch/$s_!BwGg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg 848w, https://substackcdn.com/image/fetch/$s_!BwGg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!BwGg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb3e99a-bc4b-456a-bbd3-5268b3ee26b8_3360x1340.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Common Normalization Techniques</h3><p style="text-align: justify;"><strong>Log Transformation:</strong> Take the logarithm of each value. This compresses the long tail and spreads out the clustered values. Extremely effective for income data, population data, or anything with exponential growth patterns.</p><p style="text-align: justify;"><strong>Box-Cox Transformation:</strong> A more flexible version that finds the optimal power transformation to make your data as close to a normal distribution as possible. The math picks the best &#8220;reshaping&#8221; automatically.</p><p style="text-align: justify;"><strong>Quantile Transformation:</strong> Forces the data into a specific distribution (usually uniform or normal) by mapping values to their percentile ranks. The most aggressive approach guarantees the output shape but can distort relationships between close values.</p><p style="text-align: justify;"><em><strong>The Takeaway:</strong></em> <em>Normalization is about the shape of your data, not the scale. It ensures your data&#8217;s distribution doesn&#8217;t secretly sabotage your model&#8217;s assumptions.</em></p><h1>Scaling vs. Normalization: </h1><p><em><strong>Clearing Up the Confusion</strong></em></p><p style="text-align: justify;">These two terms get mixed up constantly,</p><p style="text-align: justify;"><strong>Scaling</strong> adjusts the <em>range</em> or <em>magnitude</em> of your features. It answers: &#8220;How big are these numbers relative to each other?&#8221;</p><p style="text-align: justify;"><strong>Normalization</strong> adjusts the <em>distribution shape</em> of your features. It answers: &#8220;What does the spread of these numbers look like?&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pRN9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pRN9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pRN9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pRN9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pRN9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pRN9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg" width="1456" height="433" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:433,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:305244,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/194272657?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pRN9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pRN9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pRN9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pRN9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd10e6805-dc51-4b68-81e5-0f7ab0182f2d_3360x1000.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;">You often need both. Scale first to get features on comparable ranges, then normalize if the distribution is skewed. </p><p style="text-align: justify;"><em>Think of it as: scaling sets the volume, normalization tunes the equalizer.</em></p><h1>The Preprocessing Decision Tree</h1><p>How do you decide what preprocessing to apply?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yr3o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yr3o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yr3o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yr3o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yr3o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yr3o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg" width="1456" height="711" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:711,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:403649,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/194272657?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yr3o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Yr3o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Yr3o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Yr3o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71f0d6c5-67d1-450e-8b4f-fc9cd090cbdb_3360x1640.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Conclusion</h1><p style="text-align: justify;">Data preprocessing isn&#8217;t glamorous. Nobody writes headlines about it. But it is, without exaggeration, <strong>where most real ML work happens</strong>.</p><p style="text-align: justify;">Feature engineering decides what the model gets to see. Missing value handling decides how gaps are managed. Scaling ensures no feature unfairly dominates. Normalization reshapes distributions so algorithms can work as designed.</p><p style="text-align: justify;">These aren&#8217;t optional steps you can skip when you&#8217;re in a hurry. They&#8217;re the <strong>foundation</strong>. Every model, every prediction, every result sits on top of how well you prepared the data.</p><p style="text-align: justify;">If you step back, the pattern is strikingly familiar: whether it&#8217;s a child learning to organize their toy box or a model learning to classify images, the quality of learning depends entirely on the <strong>quality of preparation</strong>.</p><p style="text-align: justify;">Get the data right, and the model will surprise you. Skip the preparation, and no algorithm in the world will save you.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/p/data-preprocessing-for-machine-learning?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/p/data-preprocessing-for-machine-learning?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://lakshmimahabaleshwara.substack.com/p/data-preprocessing-for-machine-learning?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p style="text-align: justify;"></p>]]></content:encoded></item><item><title><![CDATA[The Core Paradigms of Machine Learning]]></title><description><![CDATA[From labeled data to autonomous decision-making an intuitive exploration of supervised, unsupervised, self-supervised, and reinforcement learning through child-inspired analogies]]></description><link>https://lakshmimahabaleshwara.substack.com/p/the-core-paradigms-of-machine-learning</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/the-core-paradigms-of-machine-learning</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Tue, 24 Mar 2026 14:03:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yp6b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Introduction</h1><p style="text-align: justify;">Most machine learning explanations start with algorithms. In this blog lets start with a simpler question: <strong>how does anything learn?</strong></p><p style="text-align: justify;">At its core, machine learning isn&#8217;t magic; it&#8217;s a set of strategies for turning data into behavior. Sometimes we guide the system with explicit answers. Sometimes we let it uncover structure on its own. And sometimes we let it learn through feedback, improving with every decision it makes.</p><p style="text-align: justify;">In this piece, we&#8217;ll walk through the four core learning paradigms: supervised, unsupervised, self-supervised, and reinforcement learning, not as abstract theory, but as intuitive learning patterns you&#8217;ve already seen in the real world. By the end, you won&#8217;t just recognize the terms, you&#8217;ll understand <strong>why these approaches exist, when they matter, and how they shape modern AI systems</strong>.</p><h1>Defining the Paradigms</h1><p style="text-align: justify;">The learning paradigms are the fundamental way an algorithm processes data to find patterns. It is not just a mathematical function; it is a distinct philosophical approach to problem-solving.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yp6b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yp6b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png 424w, https://substackcdn.com/image/fetch/$s_!yp6b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png 848w, https://substackcdn.com/image/fetch/$s_!yp6b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png 1272w, https://substackcdn.com/image/fetch/$s_!yp6b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yp6b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png" width="1456" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1582602,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/191837037?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yp6b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png 424w, https://substackcdn.com/image/fetch/$s_!yp6b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png 848w, https://substackcdn.com/image/fetch/$s_!yp6b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png 1272w, https://substackcdn.com/image/fetch/$s_!yp6b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a336eef-5e09-43d7-a6b7-24e8233e48b7_1680x631.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Four paths to pattern recognization</h2><p style="text-align: justify;">Each paradigm dictates exactly what kind of data the machine needs and how independently it is allowed to operate.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ys6q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ys6q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png 424w, https://substackcdn.com/image/fetch/$s_!Ys6q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png 848w, https://substackcdn.com/image/fetch/$s_!Ys6q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png 1272w, https://substackcdn.com/image/fetch/$s_!Ys6q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ys6q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png" width="1067" height="695" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:695,&quot;width&quot;:1067,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1058309,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/191837037?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ys6q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png 424w, https://substackcdn.com/image/fetch/$s_!Ys6q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png 848w, https://substackcdn.com/image/fetch/$s_!Ys6q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png 1272w, https://substackcdn.com/image/fetch/$s_!Ys6q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81e5291-38f6-487d-8ac6-9b3090d839f8_1067x695.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>1. Supervised Learning: The Flashcard Method</h3><ul><li><p style="text-align: justify;"><strong>The Child:</strong> You show a child a card with a picture of a &#127822; and the word &#8220;Apple&#8221; written on it. You repeat this until they can see the shape and say the word.</p></li><li><p style="text-align: justify;"><strong>The Machine:</strong> We give the model a <strong>Dataset</strong> (the pictures) and <strong>Labels</strong> (the word &#8220;Apple&#8221;). The model calculates a mathematical function to map the input to the correct output.</p></li><li><p style="text-align: justify;"><strong>Use case:</strong> This is used for <strong>Classification</strong> (Is this email spam or not?) and <strong>Regression</strong> (Predicting the exact price of a house).</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6lLu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6lLu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png 424w, https://substackcdn.com/image/fetch/$s_!6lLu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png 848w, https://substackcdn.com/image/fetch/$s_!6lLu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png 1272w, https://substackcdn.com/image/fetch/$s_!6lLu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6lLu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1521750,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/191837037?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6lLu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png 424w, https://substackcdn.com/image/fetch/$s_!6lLu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png 848w, https://substackcdn.com/image/fetch/$s_!6lLu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png 1272w, https://substackcdn.com/image/fetch/$s_!6lLu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5dbe355-e81c-4a33-aed8-66e35efa7cee_1682x671.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;"><em><strong>Ground Truth:</strong></em> <strong>The model relies entirely on the &#8220;Teacher&#8221; being 100% correct on the labels provided.</strong></p><div><hr></div><h3>2. Unsupervised Learning: Discovering on Your Own</h3><ul><li><p style="text-align: justify;"><strong>The Child:</strong> A child is given a bucket of colored blocks &#128998;&#129521;&#128993;. Without being told to, they might start putting all the red ones together and all the blue ones together.</p></li><li><p style="text-align: justify;"><strong>The Machine:</strong> The model looks for <strong>Features</strong> (color, shape, size) and calculates the &#8220;distance&#8221; between data points. Things that are &#8220;close&#8221; together get grouped into <strong>Clusters</strong>.</p></li><li><p style="text-align: justify;"><strong>Use case:</strong> This is used for <strong>Clustering</strong> (Grouping similar news articles together) and <strong>Anomaly Detection</strong> (Finding a credit card transaction that looks "weird" or "different" from your usual spending).</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ezDp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ezDp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png 424w, https://substackcdn.com/image/fetch/$s_!ezDp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png 848w, https://substackcdn.com/image/fetch/$s_!ezDp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png 1272w, https://substackcdn.com/image/fetch/$s_!ezDp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ezDp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png" width="1456" height="603" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:603,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1617127,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/191837037?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ezDp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png 424w, https://substackcdn.com/image/fetch/$s_!ezDp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png 848w, https://substackcdn.com/image/fetch/$s_!ezDp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png 1272w, https://substackcdn.com/image/fetch/$s_!ezDp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c5323c1-bac5-4229-b659-199a9fba8712_1703x705.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;"><em><strong>Clustering:</strong></em> <strong>Finding underlying structure in chaotic data when you don&#8217;t know what you are actually looking for yet.</strong></p><div><hr></div><h3>3. Self-Supervised Learning: The Context Clue</h3><ul><li><p style="text-align: justify;"><strong>The Child:</strong> You show a child a sentence with a word missing: &#8220;The cat sat on the ___.&#8221; Based on their past experience, they guess &#8220;mat&#8221; or &#8220;floor.&#8221;</p></li><li><p style="text-align: justify;"><strong>The Machine:</strong> We take a massive amount of unlabeled data (like the entire internet) and hide parts of it. The model tries to predict the missing piece.</p></li><li><p style="text-align: justify;"><strong>Use case:</strong> This is the "secret sauce" behind <strong>Large Language Models (LLMs)</strong> like ChatGPT. It allows machines to learn from massive amounts of data without humans having to label every single thing.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bH2a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bH2a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png 424w, https://substackcdn.com/image/fetch/$s_!bH2a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png 848w, https://substackcdn.com/image/fetch/$s_!bH2a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png 1272w, https://substackcdn.com/image/fetch/$s_!bH2a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bH2a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png" width="1456" height="540" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/884a5175-095c-4583-95af-baa5fd534816_1685x625.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:540,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1455257,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/191837037?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bH2a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png 424w, https://substackcdn.com/image/fetch/$s_!bH2a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png 848w, https://substackcdn.com/image/fetch/$s_!bH2a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png 1272w, https://substackcdn.com/image/fetch/$s_!bH2a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F884a5175-095c-4583-95af-baa5fd534816_1685x625.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;"><em><strong>Pre-training</strong>: </em><strong>This is how "Brainy" models like GPT are built before they are taught specific tasks.</strong></p><div><hr></div><h3>4. Reinforcement Learning: Learning from Consequences</h3><ul><li><p style="text-align: justify;"><strong>The Child:</strong> A child tries to ride a bike &#128690;. They wobble (penalty), they adjust their balance, and eventually, they move forward (reward).</p></li><li><p style="text-align: justify;"><strong>The Machine:</strong> An <strong>Agent</strong> lives in an <strong>Environment</strong>. It takes an <strong>Action</strong>, and the environment gives it a <strong>Score</strong> (Reward). The machine&#8217;s only goal is to maximize that score over time.</p></li><li><p style="text-align: justify;"><strong>Use case:</strong> This is used for <strong>Robotics</strong> and <strong>Game AI</strong> (like AlphaGo), where the machine needs to make a sequence of decisions to reach a goal.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iZ61!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iZ61!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png 424w, https://substackcdn.com/image/fetch/$s_!iZ61!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png 848w, https://substackcdn.com/image/fetch/$s_!iZ61!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png 1272w, https://substackcdn.com/image/fetch/$s_!iZ61!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iZ61!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png" width="1456" height="626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1358819,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/191837037?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iZ61!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png 424w, https://substackcdn.com/image/fetch/$s_!iZ61!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png 848w, https://substackcdn.com/image/fetch/$s_!iZ61!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png 1272w, https://substackcdn.com/image/fetch/$s_!iZ61!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3933b3-56d3-4b1c-9ae3-4d1d1406645b_1629x700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;"><em><strong>Trial and Error</strong>: </em><strong>The model doesn't need to be told the "right" answer upfront, just whether it&#8217;s action are moving it closer to the ultimate goal.</strong></p><div><hr></div><h1>The Architect&#8217;s Decision Tree</h1><p style="text-align: justify;">How to choose the right paradigm for your specific problem?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!35Gr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!35Gr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png 424w, https://substackcdn.com/image/fetch/$s_!35Gr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png 848w, https://substackcdn.com/image/fetch/$s_!35Gr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png 1272w, https://substackcdn.com/image/fetch/$s_!35Gr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!35Gr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png" width="1456" height="698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:698,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1591050,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/191837037?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!35Gr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png 424w, https://substackcdn.com/image/fetch/$s_!35Gr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png 848w, https://substackcdn.com/image/fetch/$s_!35Gr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png 1272w, https://substackcdn.com/image/fetch/$s_!35Gr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936bdd-8125-42f8-9513-84de73fd75ba_1724x826.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Conclusion</h1><p style="text-align: justify;">Machine learning, at its core, is not about models; it&#8217;s about <strong>how learning is structured</strong>.</p><p style="text-align: justify;">Supervised learning shows us the power of guidance. Unsupervised learning reveals how structure can emerge without labels. Self-supervised learning bridges the gap, turning raw data into its own teacher. And reinforcement learning pushes systems to learn through interaction, feedback, and long-term consequences.</p><p style="text-align: justify;">These paradigms aren&#8217;t competing ideas; they are <strong>complementary lenses</strong>. Modern AI systems often blend them, moving fluidly from labeled data to pattern discovery to autonomous decision-making. Understanding this progression is what separates surface-level familiarity from true intuition.</p><p style="text-align: justify;">If you step back, the pattern is strikingly simple: whether in machines or humans, learning evolves from <strong>instruction &#8594; exploration &#8594; self-discovery &#8594; adaptation</strong>.</p><p style="text-align: justify;">And once you see machine learning this way, every model, every system, and every breakthrough becomes easier to reason about, not as complexity, but as a variation of these fundamental ways of learning.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[A Beginner’s Guide to Building High-Throughput, Low-Latency Data Pipelines]]></title><description><![CDATA[Understand the key design principles, trade-offs, and architecture patterns behind building fast, reliable, and scalable data pipelines from the ground up.]]></description><link>https://lakshmimahabaleshwara.substack.com/p/a-beginners-guide-to-building-high</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/a-beginners-guide-to-building-high</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Mon, 27 Oct 2025 10:40:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qyga!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qyga!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qyga!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png 424w, https://substackcdn.com/image/fetch/$s_!qyga!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png 848w, https://substackcdn.com/image/fetch/$s_!qyga!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png 1272w, https://substackcdn.com/image/fetch/$s_!qyga!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qyga!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png" width="728" height="565.1953125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1024,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:1069001,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/177254203?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b41ee15-d42e-43f4-a0ab-3cbf29320fa6_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qyga!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png 424w, https://substackcdn.com/image/fetch/$s_!qyga!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png 848w, https://substackcdn.com/image/fetch/$s_!qyga!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png 1272w, https://substackcdn.com/image/fetch/$s_!qyga!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aab39e-9652-4c4c-91c2-8eba85f7229f_1024x795.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In today&#8217;s data-driven world, the ability to process massive amounts of information in near real-time is a significant competitive advantage. For businesses in sectors such as finance, e-commerce, and IoT, handling millions of transactions per second is no longer a futuristic dream; it&#8217;s a daily necessity. But how do you build a system that can handle this data deluge without buckling under pressure?</p><p>This guide will walk you through the fundamentals of designing a high-throughput, low-latency data pipeline. We&#8217;ll explore the initial questions to ask, the core architectural decisions, and a detailed look at the modern technologies that power today&#8217;s most demanding data platforms.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>First Things First: Asking the Right Questions</h3><p>Before diving into the technical nitty-gritty, it&#8217;s crucial to understand the &#8220;<strong>what</strong>&#8221; and &#8220;<strong>why</strong>&#8221; of your data pipeline. Rushing this stage is like building a house without a blueprint. Here are some essential questions to evaluate your needs:</p><ul><li><p><strong>What is the nature of your data?</strong> Is it structured (like financial transactions), semi-structured (like JSON logs), or unstructured (like social media feeds)?</p></li><li><p><strong>What are your throughput and latency requirements?</strong> Be specific. Do you need to process 1 million transactions per second with a 100-millisecond delay, or 10 million with a sub-50-millisecond delay?</p></li><li><p><strong>Is real-time processing a must-have?</strong> Can some data be processed in batches, or must every event be acted upon instantly?</p></li><li><p><strong>What level of data accuracy and reliability is acceptable?</strong> Do you need an &#8220;exactly-once&#8221; processing guarantee to ensure every single transaction is accounted for?</p></li><li><p><strong>What are your scalability and future growth projections?</strong> Your pipeline should be able to handle not just today&#8217;s data load, but also your expected growth.</p></li><li><p><strong>What is your budget and in-house expertise?</strong> Are you looking for open-source solutions that require more hands-on management, or are you willing to invest in managed cloud services?</p></li></ul><h3>The Core Architectural Decisions</h3><p>Once you have clear requirements, you can start making fundamental design choices.</p><h4>Batch vs. Stream Processing: The Eternal Debate</h4><ul><li><p><strong>Batch Processing:</strong> Data is collected over a period and then processed in large chunks or &#8220;batches.&#8221; This is efficient for large volumes of data where immediate action isn&#8217;t necessary, like end-of-day financial reporting.</p></li><li><p><strong>Stream Processing:</strong> This method involves processing data in real-time as it&#8217;s generated. For use cases like real-time fraud detection or instant e-commerce recommendations, stream processing is non-negotiable.</p></li></ul><p>For many modern applications that require both real-time insights and deep historical analysis, a hybrid approach is often employed. Architectures like the <strong>Lambda Architecture</strong> combine both batch and stream processing layers to serve different needs. A simpler alternative, the <strong>Kappa Architecture</strong>, relies solely on stream processing.</p><h3>Choosing Your Technology Stack: The Building Blocks</h3><p>A data pipeline is composed of several interconnected stages. Here&#8217;s a breakdown of the key components and popular technology choices for each:</p><h4><strong>1. Data Ingestion: The Front Door</strong></h4><p>This is where your data enters the pipeline. The goal here is to reliably capture a high volume of data from various sources.</p><ul><li><p><strong>Key Technologies:</strong></p><ul><li><p><strong>Apache Kafka:</strong> A distributed streaming platform that has become the standard for high-throughput, real-time data feeds. It acts as a durable and scalable &#8220;message bus.&#8221;</p></li><li><p><strong>Amazon Kinesis:</strong> A fully managed service on AWS for real-time data streaming.</p></li><li><p><strong>Google Cloud Pub/Sub:</strong> A serverless messaging service on Google Cloud that can handle millions of messages per second.</p></li></ul></li></ul><h4><strong>2. Data Processing: The Brains of the Operation</strong></h4><p>This is where the magic happens. The processing engine transforms, enriches, and analyzes the incoming data streams.</p><ul><li><p><strong>Key Technologies:</strong></p><ul><li><p><strong>Apache Flink:</strong> A powerful open-source framework for stateful stream processing, known for its high performance and low latency.</p></li><li><p><strong>Apache Spark Streaming:</strong> An extension of the popular Apache Spark analytics engine that enables scalable and fault-tolerant stream processing.</p></li><li><p><strong>Google Cloud Dataflow:</strong> A managed service that provides a unified model for both batch and stream processing.</p></li></ul></li></ul><h4>3. Data Storage: The Modern Data Platform</h4><p>Once processed, the data needs to be stored for querying, analysis, and visualization. The choice of storage depends on how you intend to use the data. This is where modern data architectures like the <strong>Data Lakehouse</strong> come into play, blending the flexibility of a data lake with the power of a data warehouse.</p><h5>The Rise of the Lakehouse: Hudi, Delta Lake, and Iceberg</h5><p>Traditional data lakes, while cost-effective for storing vast amounts of raw data on platforms like Amazon S3, often struggle with reliability and performance. This has led to the emergence of the &#8220;<em>lakehouse</em>&#8221; architecture, which brings data warehouse capabilities like <em>ACID</em> transactions and schema enforcement directly to your data lake. This is powered by open table formats that manage the underlying data files.</p><p><em><strong>Apache Hudi</strong></em></p><ul><li><p><strong>What it is:</strong> Standing for &#8220;Hadoop Upserts, Deletes and Incrementals,&#8221; Hudi is an open-source transactional data lake platform that specializes in enabling low-latency database-style operations on the data lake.</p></li><li><p><strong>How it helps:</strong> Hudi is excellent for change data capture (CDC) scenarios, allowing for efficient upserts (updating existing records) and deletes. It provides a choice between optimizing for write performance (Copy-on-Write) or query performance (Merge-on-Read), giving engineers fine-grained control over their pipeline&#8217;s tradeoffs.</p></li><li><p><strong>Key Features:</strong></p><ul><li><p><strong>Incremental Processing:</strong> Provides powerful primitives for processing only the data that has changed since the last run.</p></li><li><p><strong>Copy-on-Write vs. Merge-on-Read:</strong> Two distinct storage types to optimize for specific read/write patterns.</p></li><li><p><strong>Concurrency Control:</strong> Manages simultaneous writers and readers to ensure data consistency.</p></li></ul></li></ul><p><em><strong>Delta Lake</strong></em></p><ul><li><p><strong>What it is:</strong> An open-source storage layer that brings ACID transactions to Apache Spark and other big data engines. It adds a transaction log on top of your Parquet files in your data lake.</p></li><li><p><strong>How it helps:</strong> In a high-throughput pipeline, data is constantly being written. Delta Lake ensures that your queries always see a consistent view of the data, even during writes. It helps manage the &#8220;small file problem&#8221; by optimizing file sizes, which is crucial for query performance. Its deep integration with Spark Structured Streaming makes it a natural fit for real-time pipelines.</p></li><li><p><strong>Key Features:</strong></p><ul><li><p><strong>ACID Transactions:</strong> Guarantees data reliability and consistency for concurrent operations.</p></li><li><p><strong>Time Travel:</strong> Allows you to query previous versions of your data, invaluable for debugging and auditing.</p></li><li><p><strong>Schema Enforcement &amp; Evolution:</strong> Prevents bad data from corrupting tables and allows schemas to change over time.</p></li></ul></li></ul><p><em><strong>Apache Iceberg</strong></em></p><ul><li><p><strong>What it is:</strong> An open table format for huge analytic datasets, originally developed at Netflix. Like its counterparts, it brings the reliability of SQL tables to your data lake.</p></li><li><p><strong>How it helps:</strong> Iceberg is known for its strong schema evolution capabilities, allowing you to add, drop, or rename columns without rewriting data files. This is a massive advantage in fast-moving environments where data structures change. Its design avoids performance bottlenecks tied to a central metadata store, making it highly scalable.</p></li><li><p><strong>Key Features:</strong></p><ul><li><p><strong>Full Schema Evolution:</strong> Adapt your table schemas to evolving data sources without disrupting the pipeline.</p></li><li><p><strong>Hidden Partitioning:</strong> Simplifies querying by automatically handling data partitioning.</p></li><li><p><strong>Engine Agnostic:</strong> Designed to work with various processing engines like Spark, Flink, and Presto, preventing vendor lock-in.</p></li></ul></li></ul><h4>The Powerhouses: Managed Cloud Platforms</h4><p>For organizations that prefer a fully managed experience, cloud-native platforms offer incredible power and ease of use.</p><p><em><strong>Snowflake</strong></em></p><ul><li><p><strong>What it is:</strong> A fully managed, cloud-native data platform that can act as a data warehouse, data lake, or a combination of both.</p></li><li><p><strong>How it helps:</strong> Snowflake&#8217;s unique architecture separates storage from compute, meaning you can scale your processing power up or down independently to handle fluctuating workloads without disruption. For high-throughput ingestion, it offers tools like Snowpipe Streaming for low-latency, real-time data loading.</p></li><li><p><strong>Key Features:</strong></p><ul><li><p><strong>Separation of Storage and Compute:</strong> Provides incredible flexibility and cost-efficiency.</p></li><li><p><strong>High Concurrency:</strong> Handles a large number of concurrent users and queries without performance degradation.</p></li><li><p><strong>Semi-Structured Data Support:</strong> Natively handles formats like JSON and Avro.</p></li></ul></li></ul><p>Other major players include <strong>Google BigQuery</strong> and <strong>Amazon Redshift</strong>, which are powerful data warehouses optimized for complex analytical queries on large datasets, and <strong>Databricks</strong>, a unified platform built around Apache Spark and Delta Lake that aims to cover the entire data and AI lifecycle.</p><h4>For Specialized Access Patterns</h4><ul><li><p><strong>For Low-Latency Access:</strong> In-memory databases like <strong>Redis</strong> or <strong>Apache Ignite</strong> offer the fastest possible access times for applications that need immediate responses, such as caching or real-time feature stores. NoSQL databases like <strong>Apache Cassandra</strong> or <strong>DynamoDB</strong> are designed for high-throughput reads and writes at scale.</p></li><li><p><strong>For High-Performance Needs:</strong> For the most demanding AI and real-time analytics workloads, leveraging <strong>NVMe (Non-Volatile Memory Express) Storage</strong> can offer extremely high-speed I/O.</p></li></ul><h3>Putting It All Together: A Sample Architecture</h3><p>Imagine an e-commerce platform that needs to provide real-time product recommendations and detect fraudulent transactions. A modern pipeline could look like this:</p><ol><li><p><strong>Ingestion:</strong> User clickstream and transaction events are sent to <strong>Apache Kafka</strong>.</p></li><li><p><strong>Processing:</strong> An <strong>Apache Flink</strong> job consumes the Kafka streams in real-time, enriching and analyzing the data.</p></li><li><p><strong>Storage &amp; Serving:</strong></p><ul><li><p>The processed data is continuously written to an <strong>Apache Hudi</strong> table on Amazon S3 to efficiently handle frequent updates to user profiles and order statuses.</p></li><li><p>A fraud detection model flags suspicious transactions and sends alerts immediately.</p></li><li><p>For broad business intelligence, the Hudi tables are queried using <strong>Snowflake</strong>, which can elastically scale its compute to meet query demands from different teams without impacting the real-time pipeline.</p></li><li><p>The latest product recommendations are pushed to a low-latency <strong>Redis</strong> cache for instant retrieval by the website.</p></li></ul></li></ol><h3>The Journey Doesn&#8217;t End Here</h3><p>Building a high-throughput, low-latency data pipeline is a complex but achievable goal. By starting with a clear understanding of your requirements and leveraging modern open table formats like Hudi, Delta, and Iceberg, alongside powerful cloud platforms like Snowflake, you can create a robust, scalable, and future-proof system. Remember that a data pipeline is not a one-and-done project; it requires continuous monitoring, optimization, and iteration to meet the evolving needs of your business.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Understanding Parquet File Format]]></title><description><![CDATA[The Data Engineer&#8217;s Friend for Efficient Storage and Querying.]]></description><link>https://lakshmimahabaleshwara.substack.com/p/understanding-parquet-file-format</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/understanding-parquet-file-format</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Sun, 07 Sep 2025 12:06:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!o1Wt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>What is a File Format?</strong></h2><p>A <strong>file format</strong> defines how data is structured and stored so that software can read and write it consistently.</p><p>Think of it as the <em>container</em> for your data. Just like music comes in MP3, WAV, or FLAC, data files also come in different formats depending on the use case.</p><p><strong>Common data file formats today:</strong></p><ul><li><p><strong>CSV</strong> &#8211; Simple text-based, rows separated by commas (easy to use, but large in size and no compression).</p></li><li><p><strong>JSON</strong> &#8211; Flexible, hierarchical (good for APIs but verbose).</p></li><li><p><strong>Avro</strong> &#8211; Row-based binary format, compact, supports schema evolution (common in streaming).</p></li><li><p><strong>ORC</strong> &#8211; Columnar format optimized for Hadoop and Hive.</p></li><li><p><strong>Parquet</strong> &#8211; Columnar format designed for analytics and compression efficiency.</p></li></ul><h2><strong>Introduction to Parquet</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o1Wt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o1Wt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png 424w, https://substackcdn.com/image/fetch/$s_!o1Wt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png 848w, https://substackcdn.com/image/fetch/$s_!o1Wt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png 1272w, https://substackcdn.com/image/fetch/$s_!o1Wt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o1Wt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png" width="1024" height="621" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:621,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1105969,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/173003761?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4230651-510b-48a3-8b5e-84e1385f3fca_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o1Wt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png 424w, https://substackcdn.com/image/fetch/$s_!o1Wt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png 848w, https://substackcdn.com/image/fetch/$s_!o1Wt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png 1272w, https://substackcdn.com/image/fetch/$s_!o1Wt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d6acb5-beeb-49b3-8f21-a4ed6621544d_1024x621.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you work with data, you&#8217;ve probably heard the word <strong>Parquet</strong> thrown around in conversations about data lakes, analytics, or cloud storage. But what exactly is it, and why does everyone seem to love it?</p><h3><strong>The Everyday Analogy: Organizing Your Groceries</strong></h3><p>Imagine you run a grocery delivery service, and you have to store all the items customers ordered.</p><p>You could store them in two ways:</p><ol><li><p><strong>Row-based storage (like CSV)</strong></p><p>You put <em>one of each</em> item from an order in a bag: milk, bread, apples, pasta &#8212; all together. If you want to find <em>all</em> the bread you&#8217;ve sold, you have to open every bag and dig around to see if bread is inside.</p></li><li><p><strong>Column-based storage (like Parquet)</strong></p><p>Instead of mixing, you group similar items together:</p><ul><li><p>One box for all the milk</p></li><li><p>One box for all the bread</p></li><li><p>One box for all the apples</p></li></ul><p>If you need <em>all</em> the bread you&#8217;ve sold, you simply open the bread box &#8212; no need to touch anything else.</p></li></ol><p>That&#8217;s Parquet in action, <strong>organized storage that makes retrieval faster and more efficient.</strong></p><h3><strong>The Technical View</strong></h3><p>Parquet is an <strong>open-source, columnar storage format</strong> developed by Twitter and Cloudera, now an Apache project. It&#8217;s:</p><ul><li><p><strong>Self-describing</strong> &#8211; Metadata in the file footer stores schema, number of rows, and statistics.</p></li><li><p><strong>Highly compressed</strong> &#8211; Supports Snappy, Gzip, ZSTD, and more.</p></li><li><p><strong>Structured internally</strong> into:</p><ul><li><p><strong>Row Groups</strong> &#8594; subsets of rows (horizontal partitions).</p></li><li><p><strong>Column Chunks</strong> &#8594; data for each column inside a row group (vertical partitions).</p></li><li><p><strong>Pages</strong> &#8594; the smallest unit of storage, making reads efficient.</p></li></ul></li><li><p><strong>Magic Number (PAR1)</strong> &#8211; Ensures the file is recognized as Parquet.</p></li></ul><p>This hybrid layout balances efficient columnar reads with manageable write performance.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>Why Choose Parquet Over Other Formats?</strong></h2><p>Now that we understand Parquet, let&#8217;s see why it stands out compared to other formats.</p><ul><li><p><strong>CSV</strong> &#8211; Human-readable but inefficient; no compression or schema support.</p></li><li><p><strong>JSON</strong> &#8211; Flexible for nested data, but verbose and large in size.</p></li><li><p><strong>Avro</strong> &#8211; Great for row-based streaming but not optimized for analytics.</p></li><li><p><strong>ORC</strong> &#8211; Similar to Parquet, optimized for Hadoop and Hive workloads.</p></li></ul><p><strong>Parquet vs the rest:</strong></p><ul><li><p>Much smaller file sizes due to efficient compression</p></li><li><p>Faster query performance with column pruning and predicate pushdown</p></li><li><p>Widely supported across Spark, Hive, Presto, Trino, BigQuery, Snowflake, AWS Athena, and more</p></li><li><p>Ideal for data lakes and analytical workloads</p></li></ul><h2><strong>How Data is Written in Parquet</strong></h2><p>When writing data into a Parquet file, the process follows a layered structure:</p><ol><li><p><em><strong>Rows are collected into Row Groups</strong></em></p><ul><li><p>Instead of writing each record one by one (like CSV), Parquet groups multiple rows together into a <strong>Row Group</strong> (horizontal partition).</p></li></ul></li><li><p><em><strong>Within each Row Group, data is split into Column Chunks</strong></em></p><ul><li><p>Each column&#8217;s values are stored together &#8594; Column A values go into <strong>Column Chunk A</strong>, Column B values into <strong>Column Chunk B</strong>, and so on.</p></li><li><p>This preserves the columnar storage advantage.</p></li></ul></li><li><p><em><strong>Column Chunks are broken into Pages</strong></em></p><ul><li><p>Pages are the <strong>smallest unit of storage</strong> in Parquet.</p></li><li><p>Types of pages:</p><ul><li><p><strong>Data Pages</strong> &#8594; actual column data</p></li><li><p><strong>Dictionary Pages</strong> &#8594; dictionary encoding for repeated values</p></li><li><p><strong>Index Pages</strong> (optional) &#8594; help with seeking</p></li></ul></li></ul></li><li><p><em><strong>Metadata is written into the File Footer</strong></em></p><ul><li><p>Contains schema, row count, statistics (min/max per column), and row group metadata.</p></li><li><p>Makes Parquet a <strong>self-describing format</strong>.</p></li></ul></li><li><p><em><strong>Magic Number (PAR1) is added</strong></em></p><ul><li><p>At the start and end of the file &#8594; ensures the file can be validated as a Parquet file.</p></li></ul></li></ol><h2><strong>How Data is Read from a Parquet File</strong></h2><ol><li><p><em><strong>File Footer Read First</strong></em></p></li></ol><ul><li><p>Parquet doesn&#8217;t scan the entire file right away.</p></li><li><p>It jumps directly to the <strong>footer</strong>, where all the <strong>metadata</strong> is stored.</p></li><li><p>Metadata tells the engine:</p><ul><li><p>Schema (what columns exist)</p></li><li><p>Number of rows</p></li><li><p>Statistics for each column in each row group (min/max)</p></li><li><p>Row group and column chunk offsets (where the data lives in the file)</p></li></ul></li></ul><p>&#10145;&#65039; This helps the engine decide <strong>which parts of the file to actually read</strong>.</p><p><em><strong>   2. Predicate Pushdown &amp; Column Pruning</strong></em></p><ul><li><p>If the query only needs certain columns (SELECT name, age), only those <strong>column chunks</strong> are accessed.</p></li><li><p>If the query has filters (WHERE age &gt; 30), Parquet uses <strong>min/max stats in metadata</strong> to skip entire <strong>row groups</strong> that don&#8217;t match.</p></li></ul><p>&#10145;&#65039; This reduces I/O drastically compared to row-based formats.</p><p><em><strong>3. Reading Row Groups</strong></em></p><ul><li><p>The engine opens only the&nbsp;<strong>relevant row groups</strong>.</p></li><li><p>Inside each row group, it locates the <strong>column chunks</strong> needed for the query.</p></li></ul><p><em><strong>4. Decompression &amp; Decoding Pages</strong></em></p><ul><li><p>Column chunks are split into <strong>pages</strong>.</p></li><li><p>Each page is decompressed (Snappy, Gzip, ZSTD, etc.) and decoded (Dictionary, RLE, Delta encoding).</p></li><li><p>Pages are read into memory one at a time, which is efficient.</p></li></ul><p><em><strong>5. Reconstructing Results</strong></em></p><ul><li><p>Once relevant column data is read, decompressed, and decoded, the engine reconstructs the requested rows and columns.</p></li><li><p>The final output is materialized as a result set for the query.</p></li></ul><h2><strong>Parquet File Size: Small vs Large</strong></h2><p>The size of each Parquet file matters a lot for performance.</p><ul><li><p><strong>Too Small Files (&lt; 50 MB each)</strong></p><ul><li><p>Too many metadata and file-open operations.</p></li><li><p>Increases overhead for query engines (Spark, Presto, Hive).</p></li><li><p>Known as the <em>small files problem</em>.</p></li></ul></li><li><p><strong>Too Large Files (&gt; 1 GB each)</strong></p><ul><li><p>Slows down reads and reduces parallelism.</p></li></ul></li></ul><p><strong>&#128161; Ideal File Size:</strong></p><ul><li><p><strong>256 MB to 1 GB</strong> for most big data frameworks.</p></li><li><p>Large enough to compress well, small enough for parallelism.</p></li></ul><h2><strong>How Data Types Affect Compression (and File Size)</strong></h2><p>Parquet&#8217;s efficiency is heavily influenced by <strong>data type selection</strong>:</p><ul><li><p><strong>Strings vs Integers:</strong> Integers compress far better than storing numbers as strings.</p></li><li><p><strong>Decimals vs Floats:</strong> Decimals (fixed precision) often compress better.</p></li><li><p><strong>Categorical Data:</strong> Repeated values (like &#8220;USA&#8221;) compress extremely well with dictionary encoding.</p></li><li><p><strong>Timestamps:</strong> Storing as INT64 (epoch values) is more efficient than strings.</p></li></ul><p><strong>Tip:</strong> Choose the most precise and smallest data type possible to save space and improve query speed.</p><h2><strong>Encoding Strategies used in Parquet</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-336!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-336!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!-336!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!-336!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!-336!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-336!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1848260,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/173003761?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-336!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!-336!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!-336!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!-336!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa629efd3-0624-4e8d-ba2b-9e9cf619c4d5_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>1. Run Length Encoding (RLE)</strong></em></p><ul><li><p><strong>How it works</strong>: Stores consecutive repeated values as a single value + a count.</p></li><li><p><strong>Best for</strong>: Columns with long runs of the same value (e.g., flags, booleans, country/state codes).</p><p><strong>Example</strong>:</p><p>Raw &#8594; A A A A B B C C C</p><p>RLE &#8594; (A,4), (B,2), (C,3)</p></li><li><p><strong>Benefit</strong>: Extremely efficient for repetitive data, reduces size drastically.</p></li></ul><p><em><strong>2. Dictionary Encoding</strong></em></p><ul><li><p><strong>How it works</strong>: Builds a dictionary of unique values in a column and replaces each value with a small index reference.</p></li><li><p><strong>Best for</strong>: Low-cardinality columns (few unique values, but repeated often).</p><p><strong>Example</strong>:</p><p>Raw &#8594; USA, USA, CAN, USA, CAN, MEX</p><p>Dictionary &#8594; {0: USA, 1: CAN, 2: MEX}</p><p>Encoded &#8594; 0, 0, 1, 0, 1, 2</p></li><li><p><strong>Benefit</strong>: Saves space when values repeat often; lookups are fast.</p></li></ul><p><em><strong>3. Delta Encoding</strong></em></p><ul><li><p><strong>How it works</strong>: Stores the difference (delta) between consecutive values instead of storing full values.</p></li><li><p><strong>Best for</strong>: Sequential/numeric data like timestamps, IDs, counters.</p><p><strong>Example</strong>:</p><p>Raw &#8594; 100, 105, 110, 115</p><p>Delta &#8594; 100, +5, +5, +5</p></li><li><p><strong>Benefit</strong>: Great for compressing increasing sequences; works well with integers and dates.</p></li></ul><h2><strong>Parquet Optimization &amp; Best Practices</strong></h2><ul><li><p><strong>Predicate Pushdown</strong> &#8211; Enable in your query engine to filter row groups at scan time.</p></li><li><p><strong>Compression Codecs</strong> &#8211;</p><ul><li><p>Snappy: Fast, decent ratio</p></li><li><p>Gzip: Higher compression, slower</p></li><li><p>ZSTD: Good balance</p></li></ul></li><li><p><strong>Avoid Small Files</strong> &#8211; Merge into larger files to reduce metadata overhead.</p></li><li><p><strong>Row Group Size</strong> &#8211; Balance between compression (bigger row groups) and parallelism (smaller ones).</p></li><li><p><strong>Encoding Strategies</strong> &#8211;</p><ul><li><p>Dictionary Encoding for low-cardinality strings</p></li><li><p>Run-Length Encoding for repeated values</p></li><li><p>Delta Encoding for incremental numbers like timestamps</p></li></ul></li><li><p><strong>Sort Data</strong> &#8211; Sorting on commonly filtered columns improves compression and predicate pushdown.</p></li><li><p><strong>ACID Transactions</strong> &#8211; Parquet itself doesn&#8217;t provide ACID; use table formats like <strong>Delta Lake, Iceberg, or Hudi</strong> for transactional guarantees.</p></li></ul><h2><strong>Conclusion</strong></h2><p>Parquet has become a <strong>default choice</strong> in the modern data stack because it offers the best of both worlds: <strong>compact storage</strong> and <strong>fast analytics</strong>.</p><p>By understanding how it works internally (row groups, column chunks, pages) and following best practices (ideal file sizes, data types, compression, and encodings), you can save costs, improve query speeds, and build scalable pipelines.</p><p>In short, <strong>Parquet is the &#8220;organized pantry&#8221; of the data world: efficient, structured, and always ready when you need it.</strong></p><h2><strong>Further Reading</strong></h2><p>For a deeper dive into the Parquet format&#8217;s internals, file layout, and optimizations, check out <strong><a href="https://vutr.substack.com/p/the-overview-of-parquet-file-format">&#8220;The Overview of Parquet File Format&#8221;</a></strong> by Vu Trinh. It&#8217;s a clear and insightful read for data engineers looking to go beyond the basics.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[A Practical Guide to Data World Jargon]]></title><description><![CDATA[Decoding Key Terms and Concepts Every Data Professional Should Know]]></description><link>https://lakshmimahabaleshwara.substack.com/p/a-practical-guide-to-data-world-jargon</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/a-practical-guide-to-data-world-jargon</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Fri, 08 Aug 2025 10:32:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!WsYL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WsYL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WsYL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png 424w, https://substackcdn.com/image/fetch/$s_!WsYL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png 848w, https://substackcdn.com/image/fetch/$s_!WsYL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png 1272w, https://substackcdn.com/image/fetch/$s_!WsYL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WsYL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png" width="1182" height="1568" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1568,&quot;width&quot;:1182,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:196634,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/170433698?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328f6b45-704c-4c49-93ca-1814b5625429_1414x2000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WsYL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png 424w, https://substackcdn.com/image/fetch/$s_!WsYL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png 848w, https://substackcdn.com/image/fetch/$s_!WsYL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png 1272w, https://substackcdn.com/image/fetch/$s_!WsYL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cb9c88c-7eb8-4b65-b9ef-7816d5173e79_1182x1568.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I worked as a software engineer, mainly focusing on backend development, for a long time, and in 2018, I switched to data engineering, building data pipelines for huge tax datasets at Intuit.</p><p>During my initial days, attending meetings felt like walking into a foreign-language class. Terms like <em>data lake</em>, <em>data lineage</em>, and <em>delta lake</em> were thrown around casually, and I was left scribbling them down so I could Google them later.</p><p>If you&#8217;ve ever found yourself drowning in data lingo, this blog is for you. We&#8217;ll break down the most important concepts in today&#8217;s data landscape, from the basics like <strong>data warehouse</strong> to emerging ideas like <strong>data mesh</strong>, so you can navigate conversations and projects with confidence.</p><h2><strong>1. Data Collection &amp; Storage</strong></h2><h3><strong>Data Collection</strong></h3><p>The process of gathering information from various sources to be stored, processed, and analyzed later.</p><ul><li><p><strong>Sources:</strong> Databases, APIs, sensors, user interactions, surveys, web scraping, and third-party datasets.</p></li><li><p><strong>Goal:</strong> Capture reliable, consistent, and usable data for downstream processes.</p></li><li><p><strong>Example:</strong> Collecting transaction logs from an e-commerce site, IoT device readings, or customer feedback forms.</p></li></ul><h3><strong>Data Ingestion</strong></h3><p>The process of bringing raw data from multiple sources into a storage system, such as a data lake or data warehouse.</p><ul><li><p><strong>Types:</strong> Batch ingestion (scheduled loads) and streaming ingestion (real-time flows).</p></li><li><p><strong>Example:</strong> Loading website clickstream data into AWS S3 or Kafka.</p></li></ul><h3><strong>Data Lake</strong></h3><p>A centralized repository that stores raw, unprocessed data &#8212; structured, semi-structured, and unstructured at any scale.</p><ul><li><p><strong>Use Case:</strong> Ideal for storing large amounts of data for future analysis.</p></li><li><p><strong>Example:</strong> AWS S3, Azure Data Lake Storage.</p></li></ul><h3><strong>Delta Lake</strong></h3><p>An open-source storage layer that adds <strong>ACID transactions</strong> to data lakes, making them reliable for analytics.</p><ul><li><p><strong>Key Benefit:</strong> Handles versioning, schema enforcement, and upserts.</p></li><li><p><strong>Example:</strong> Databricks Delta Lake.</p></li></ul><h3><strong>Data Warehouse</strong></h3><p>A structured, high-performance database optimized for analytics and reporting. Data here is cleaned, transformed, and stored in a relational format.</p><ul><li><p><strong>Use Case:</strong> Business intelligence dashboards, KPI reporting.</p></li><li><p><strong>Example:</strong> Snowflake, Google BigQuery, Amazon Redshift.</p></li></ul><h3><strong>Data Lakehouse</strong></h3><p>A hybrid architecture combining the flexibility of a <strong>data lake</strong> with the reliability and performance of a <strong>data warehouse</strong>.</p><ul><li><p><strong>Example:</strong> Databricks Lakehouse Platform.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>2. Access &amp; Discovery</strong></h2><h3><strong>Data Access Points</strong></h3><p>The interfaces, services, or endpoints through which users, applications, or systems retrieve or interact with data.</p><ul><li><p><strong>Examples:</strong> APIs, SQL query endpoints, data catalog portals, streaming feeds.</p></li><li><p><strong>Importance:</strong> Determines how securely and efficiently data consumers can get the information they need.</p></li></ul><h3><strong>Data Discovery</strong></h3><p>The process of finding and cataloging available datasets within an organization.</p><ul><li><p><strong>Tool Examples:</strong> Alation, Collibra, Apache Atlas.</p></li></ul><h3><strong>Data Catalog</strong></h3><p>An organized inventory of datasets, including metadata, lineage, and profiling details.</p><ul><li><p><strong>Purpose:</strong> Makes data assets easier to find and understand.</p></li></ul><h3><strong>Metadata</strong></h3><p>Data about data &#8212; describing its source, format, owner, creation date, and usage patterns.</p><ul><li><p><strong>Example:</strong> A dataset&#8217;s metadata might include its schema, last refresh date, and responsible data owner.</p></li></ul><h3><strong>Data Dictionary</strong></h3><p>A centralized repository that contains detailed descriptions of all data elements within a database, data warehouse, or system.</p><ul><li><p><strong>What It Includes:</strong> Field names, data types, allowed values, constraints, relationships, and business definitions.</p></li><li><p><strong>Purpose:</strong> Ensures consistent understanding and usage of data across teams, supports documentation, and helps with governance.</p></li><li><p><strong>Example:</strong> In a customer database, the data dictionary might specify that <em>Customer_ID</em> is a unique integer, <em>Email_Address</em> is a string with a valid format, and <em>Join_Date</em> is stored in YYYY-MM-DD format.</p></li></ul><h2><strong>3. Management &amp; Governance</strong></h2><h3><strong>Data Governance</strong></h3><p>The policies, roles, and processes ensuring data is accurate, secure, compliant, and responsibly used.</p><ul><li><p><strong>Example Components:</strong> Access control, compliance monitoring, stewardship.</p></li></ul><h3><strong>Data Stewardship</strong></h3><p>The practice of managing and ensuring data quality within a specific domain or subject area.</p><ul><li><p><strong>Role:</strong> Data stewards are responsible for approving changes and ensuring consistency.</p></li></ul><h3><strong>Master Data Management (MDM)</strong></h3><p>A discipline focused on creating a single, consistent view of core business entities (e.g., customers, products).</p><ul><li><p><strong>Benefit:</strong> Avoids duplication and ensures a single source of truth.</p></li></ul><h3><strong>Data Quality</strong></h3><p>Measures how fit the data fits its intended purpose.</p><ul><li><p><strong>Key Dimensions:</strong> Accuracy, completeness, consistency, timeliness.</p></li><li><p><strong>Impact:</strong> High-quality data leads to better analytics and decision-making.</p></li></ul><h2><strong>4. Processing, Movement &amp; Analysis</strong></h2><h3><strong>Data Integration</strong></h3><p>The process of combining data from different sources into a unified view.</p><ul><li><p><strong>Methods:</strong> ETL, ELT, APIs, middleware.</p></li><li><p><strong>Goal:</strong> Make data from multiple systems work together for analysis and reporting.</p></li><li><p><strong>Example:</strong> Merging CRM data with financial transaction records to get a full customer view.</p></li></ul><h3><strong>ETL (Extract, Transform, Load)</strong></h3><p>A data integration process where data is extracted from sources, transformed for quality and compatibility, and then loaded into a target system.</p><ul><li><p><strong>When Used:</strong> Traditional batch processing.</p></li></ul><h3><strong>ELT (Extract, Load, Transform)</strong></h3><p>A modern approach where raw data is loaded first and transformed inside the target system.</p><ul><li><p><strong>When Used:</strong> Cloud-native warehouses like Snowflake or BigQuery.</p></li></ul><h3><strong>Data Pipeline</strong></h3><p>An automated sequence of steps to move, transform, and store data between systems.</p><ul><li><p><strong>Example:</strong> Kafka stream &#8594; Spark transformation &#8594; Redshift load.</p></li></ul><h3><strong>Data Profiling</strong></h3><p>The process of analyzing data to understand its structure, quality, and relationships. It involves examining data for completeness, consistency, uniqueness, and patterns to identify anomalies or issues before processing.</p><ul><li><p><strong>Example Output:</strong> Null count, data type mismatches, range checks.</p></li></ul><h3><strong>Data Lineage</strong></h3><p>A visual or documented map showing where data came from, how it was transformed, and where it is stored.</p><ul><li><p><strong>Value:</strong> Essential for debugging, auditing, and compliance.</p></li></ul><h3><strong>Data Analysis</strong></h3><p>The process of inspecting, cleaning, transforming, and modelling data to uncover useful information, patterns, and insights that support decision-making.</p><ul><li><p><strong>Techniques:</strong> Statistical analysis, visualization, machine learning, trend detection.</p></li><li><p><strong>Output:</strong> Reports, dashboards, predictive models, recommendations.</p></li><li><p><strong>Example:</strong> Analyzing sales trends to forecast demand, or using customer purchase history to recommend products.</p></li></ul><h2><strong>5. Modern Architectural Approaches</strong></h2><h3><strong>Data Mesh</strong></h3><p>A decentralized data architecture paradigm that treats data as a product, with domain-specific teams owning and managing their data.</p><ul><li><p><strong>Benefit:</strong> Improves scalability and reduces bottlenecks in large organizations.</p></li></ul><h3><strong>Data Fabric</strong></h3><p>An architecture that uses metadata-driven automation to integrate, manage, and provide access to data across hybrid and multi-cloud environments. It simplifies data management by automating tasks like integration and governance.</p><ul><li><p><strong>Benefit:</strong> Automation, AI-driven insights, cross-platform compatibility.</p></li></ul><h2><strong>Conclusion</strong></h2><p>The data world moves fast; today&#8217;s buzzword can become tomorrow&#8217;s standard. Understanding these terms isn&#8217;t just about keeping up with industry jargon; it&#8217;s about being able to design, build, and manage systems effectively.</p><p>When I first moved from backend development into data engineering, learning this vocabulary was my survival kit. Hopefully, this guide will be yours too.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[🔐 Probabilistic vs Deterministic Encryption: Understanding the Building Blocks of Secure Systems]]></title><description><![CDATA[Encryption Deep Dive: Base Keys, Derived Keys, and Where to Encrypt (Local vs Remote)]]></description><link>https://lakshmimahabaleshwara.substack.com/p/probabilistic-vs-deterministic-encryption</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/probabilistic-vs-deterministic-encryption</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Wed, 06 Aug 2025 10:02:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZV-j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZV-j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZV-j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png 424w, https://substackcdn.com/image/fetch/$s_!ZV-j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png 848w, https://substackcdn.com/image/fetch/$s_!ZV-j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png 1272w, https://substackcdn.com/image/fetch/$s_!ZV-j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZV-j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png" width="525" height="482" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58172953-26d4-4e55-ad2d-e8a478537629_525x482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:482,&quot;width&quot;:525,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:170148,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/170245467?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZV-j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png 424w, https://substackcdn.com/image/fetch/$s_!ZV-j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png 848w, https://substackcdn.com/image/fetch/$s_!ZV-j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png 1272w, https://substackcdn.com/image/fetch/$s_!ZV-j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58172953-26d4-4e55-ad2d-e8a478537629_525x482.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When I worked at&nbsp;<strong>Intuit</strong>, my team was responsible for building the&nbsp;<strong>data pipeline</strong>&nbsp;that processed TurboTax <strong>e-filing data</strong> received from the TurboTax website. This data needed to be made available for various dashboards and analytics use cases.</p><p>A major challenge was managing&nbsp;<strong>PII (Personally Identifiable Information)</strong>&nbsp;such as Social Security Numbers (SSNs). These are highly sensitive and must be protected, but the a<strong>nalytics team also</strong>&nbsp;needed to&nbsp;<strong>search and filter</strong>&nbsp;records using these fields for legitimate business insights.</p><p>In this blog, I&#8217;ll explain how we approached encryption in our pipeline, the concepts of <strong>deterministic vs probabilistic encryption</strong>, and how <strong>local vs remote encryption</strong> workflows work in practice.</p><h2><strong>&#128272; Why Encryption Was Needed</strong></h2><p>PII, such as SSNs is highly sensitive. We needed a solution that:</p><ul><li><p><strong>Protected confidentiality</strong> in case of a data breach.</p></li><li><p>Allowed <strong>exact match search</strong> on encrypted values for analytics queries.</p></li><li><p>Complied with data protection regulations.</p></li></ul><p>To achieve this, we used <strong>Deterministic Encryption</strong> with <strong>derived keys</strong> while storing the data in <strong>Hive</strong>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>Deterministic vs Probabilistic Encryption</strong></h2><h3><strong>&#127922; Probabilistic Encryption:</strong></h3><ul><li><p>Every time you encrypt the <strong>same value</strong> with the same key, you get a <strong>different ciphertext</strong>.</p></li><li><p>This adds randomness and makes it harder for attackers to guess values based on frequency.</p></li><li><p>However, it&#8217;s <strong>not directly searchable</strong>, since each encryption produces different results.</p></li></ul><pre><code><code>Encrypt('ABC', 'AES', key1) &#8594; 'qwoeoewowe' 
Encrypt('ABC', 'AES', key1) &#8594; 'c,xcslslsd' 
Encrypt('ABC', 'AES', key1) &#8594; 'fjkdfdfd;xsd'  

Encrypt('DEF', 'AES', key1) &#8594; 'fdf2424;SER' 
Encrypt('DEF', 'AES', key1) &#8594; 'kl,sek8457'</code></code></pre><p><strong>Use Case:</strong></p><ul><li><p>Ideal when <strong>data confidentiality</strong> is the top priority and <strong>searchability is not required</strong>.</p></li></ul><p><strong>Strength:</strong></p><ul><li><p>Defends against <strong>ciphertext pattern attacks</strong>.</p></li></ul><p><strong>Downside:</strong></p><ul><li><p>Equality checks or searching on encrypted values are <strong>impossible</strong> without decryption.</p></li></ul><h3><strong>&#9989; Deterministic Encryption:</strong></h3><ul><li><p>Every time you encrypt the <strong>same value</strong> with the same key, you get the <strong>same ciphertext</strong>.</p></li><li><p>This makes it <strong>searchable</strong> because identical inputs always produce identical outputs.</p></li><li><p>The trade-off is slightly lower security compared to probabilistic encryption, since patterns can be detected.</p></li></ul><pre><code>Encrypt('ABC', 'AES-SIV', key2) &#8594; 'adsfffdfd'
Encrypt('ABC', 'AES-SIV', key2) &#8594; 'adsfffdfd'
Encrypt('ABC', 'AES-SIV', key2) &#8594; 'adsfffdfd'

Encrypt('DEF', 'AES-SIV', key2) &#8594; 'als34asaual'
Encrypt('DEF', 'AES-SIV', key2) &#8594; 'als34asaual'</code></pre><p><strong>Use Case:</strong></p><ul><li><p>Useful when you need to <strong>query encrypted fields</strong>, e.g., searching for users with a specific SSN or email.</p></li></ul><p><strong>Downside:</strong></p><ul><li><p>Vulnerable to <strong>pattern analysis</strong> &#8212; if two ciphertexts match, they came from the same plaintext.</p></li></ul><p>In our pipeline, we <strong>chose deterministic encryption</strong> because the Analytics team needed to query encrypted SSNs directly without decrypting them.</p><p><em>For example</em>, if we are searching for SSN: '111-22-3333', and we know all SSNs are encrypted with key 'key2', then we can look for that SSN even if it's encrypted in the DB as</p><pre><code>SELECT * FROM TABLE 
WHERE SSN = Encrypt('111-22-3333', 'AES-SIV', key2);</code></pre><h2><strong>&#129513; Base Keys and Derived Keys</strong></h2><h3><strong>&#128273; Base Key:</strong></h3><p>A <strong>base key</strong> (master key) is the root key stored securely (e.g., in a Key Management System or HSM). It is <strong>never</strong> used directly for encrypting data.</p><h3><strong>&#129514; Derived Key:</strong></h3><p>A <strong>derived key</strong> is generated from the base key using a deterministic function like HKDF to produce unique keys for specific data sets or purposes. Derived Keys are </p><ul><li><p>Are unique per user or context</p></li><li><p>Enable <strong>key separation</strong></p></li><li><p>Allow <strong>rotating individual keys</strong> without touching the base key</p></li></ul><p>This separation means that even if a derived key is compromised, the base key remains safe, and other derived keys are unaffected.</p><h2><strong>&#127968; Local vs &#127760; Remote Encryption</strong></h2><h3><strong>&#127968; Local Encryption (Client-side):</strong></h3><ul><li><p>The <strong>application server</strong> retrieves the encryption key from a secure key store (like AWS KMS, Google Cloud KMS, Azure Key Vault) and performs encryption locally.</p></li><li><p><strong>Pros</strong>: Faster, fewer network calls after key retrieval.</p></li><li><p><strong>Cons</strong>: Keys are briefly present on the application server (higher risk if server is compromised).</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IaqO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IaqO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png 424w, https://substackcdn.com/image/fetch/$s_!IaqO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png 848w, https://substackcdn.com/image/fetch/$s_!IaqO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png 1272w, https://substackcdn.com/image/fetch/$s_!IaqO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IaqO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png" width="564" height="651" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:651,&quot;width&quot;:564,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64397,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/170245467?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IaqO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png 424w, https://substackcdn.com/image/fetch/$s_!IaqO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png 848w, https://substackcdn.com/image/fetch/$s_!IaqO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png 1272w, https://substackcdn.com/image/fetch/$s_!IaqO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b15fe3-eb75-4c80-8ad8-8949ca079349_564x651.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>&#127760; Remote Encryption (Server-side):</strong></h3><ul><li><p>The <strong>application server never sees the key</strong>. It sends the value to encrypt to the KMS, which returns the encrypted value.</p></li><li><p><strong>Pros</strong>: More secure since keys never leave the KMS.</p></li><li><p><strong>Cons</strong>: Slower due to extra network calls.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M5vP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M5vP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png 424w, https://substackcdn.com/image/fetch/$s_!M5vP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png 848w, https://substackcdn.com/image/fetch/$s_!M5vP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png 1272w, https://substackcdn.com/image/fetch/$s_!M5vP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M5vP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png" width="559" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:559,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62514,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/170245467?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M5vP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png 424w, https://substackcdn.com/image/fetch/$s_!M5vP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png 848w, https://substackcdn.com/image/fetch/$s_!M5vP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png 1272w, https://substackcdn.com/image/fetch/$s_!M5vP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc35a9c-25fb-433a-a321-766dc57c468b_559x632.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Glossary</strong></h2><ul><li><p><strong>AES (Advanced Encryption Standard)</strong>: A widely used symmetric encryption algorithm.</p></li><li><p><strong>AES-SIV (AES&#8211;Synthetic IV)</strong>: A deterministic encryption mode that produces the same ciphertext for the same input and key, while still ensuring authenticity.</p></li><li><p><strong>HKDF (HMAC-based Key Derivation Function)</strong>: A standard method to derive multiple keys from a single master key.</p></li><li><p><strong>KMS (Key Management Service)</strong>: A secure service to store, manage, and use encryption keys (e.g., AWS KMS, Google Cloud KMS, Azure Key Vault).</p></li></ul><h2><strong>Conclusion</strong></h2><p>By combining <strong>deterministic encryption</strong> with <strong>derived keys</strong>, we ensured that:</p><ul><li><p>PII was protected at rest.</p></li><li><p>Data could still be searched for analytics.</p></li><li><p>Encryption keys were managed securely via a base&#8211;derived key architecture.</p></li></ul><p>Whether you choose <strong>local or remote encryption</strong> depends on your <strong>security requirements</strong> and <strong>performance needs</strong>. In high-security environments, remote encryption is preferred, while local encryption offers faster performance when the risk is lower.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Medallion Architecture Explained: When It Shines and When to Keep It Simple]]></title><description><![CDATA[A practical look at when to use the bronze&#8211;silver&#8211;gold data lakehouse pattern, how it supports data quality, and why it may not fit every use case]]></description><link>https://lakshmimahabaleshwara.substack.com/p/medallion-architecture-explained</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/medallion-architecture-explained</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Sat, 26 Jul 2025 08:01:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DaDu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>Introduction</strong></h2><p>Medallion Architecture, often visualized as the bronze, silver, and gold layers of a data platform, has become a modern standard for structuring data lakes and lakehouses. But like any architectural choice, it isn&#8217;t a universal solution.</p><p>In this blog, we&#8217;ll explore what Medallion Architecture offers, when it shines, the trade-offs it brings, and best practices to help you decide if it&#8217;s right for your organization.</p><h2><strong>What is Medallion Architecture?</strong></h2><p><strong>Medallion architecture</strong> is a layered design pattern for organizing data in a lakehouse. It enables us to systematically improve the quality, reliability, and accessibility of data through a series of logical zones&#8212;<em>Bronze</em>, <em>Silver</em>, and <em>Gold</em>&#8212;each with a clear purpose:</p><p><em><strong>Image from Databricks.com</strong></em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DaDu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DaDu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png 424w, https://substackcdn.com/image/fetch/$s_!DaDu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png 848w, https://substackcdn.com/image/fetch/$s_!DaDu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!DaDu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DaDu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png" width="1456" height="700" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:700,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56615,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/169286216?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DaDu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png 424w, https://substackcdn.com/image/fetch/$s_!DaDu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png 848w, https://substackcdn.com/image/fetch/$s_!DaDu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!DaDu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d57ff0-65ff-4093-a960-5d1eb2810bd6_2288x1100.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This separation helps data teams manage complexity, improve data quality, and keep raw data traceable for audits and reprocessing.</p><ul><li><p><strong>Bronze</strong>: This is the raw landing zone. Here, we directly dump data from various sources, including APIs, event logs, batch files, and streaming data. No transformations, just as-it-comes&#8212;think append-only, immutable storage. This zone provides our data lineage and audit trail.</p></li><li><p><strong>Silver</strong>: Here&#8217;s where the real work happens. We clean up, validate, drop corrupt records, deduplicate, standardize formats (e.g., consistent timestamps), and join across sources. This is typically the richest layer for data scientists, who require up-to-date, reliable data for modelling.</p></li><li><p><strong>Gold</strong>: Optimized for business and analytics consumption. These are the datasets behind our dashboards, quarterly reports, and executive summaries. We aggregate, calculate KPIs, and model according to business needs, often using dimensional models like star schemas.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z9mF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z9mF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png 424w, https://substackcdn.com/image/fetch/$s_!Z9mF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png 848w, https://substackcdn.com/image/fetch/$s_!Z9mF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png 1272w, https://substackcdn.com/image/fetch/$s_!Z9mF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z9mF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png" width="765" height="183" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:183,&quot;width&quot;:765,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29883,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/169286216?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z9mF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png 424w, https://substackcdn.com/image/fetch/$s_!Z9mF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png 848w, https://substackcdn.com/image/fetch/$s_!Z9mF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png 1272w, https://substackcdn.com/image/fetch/$s_!Z9mF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50c3be-0f05-4287-b10e-874a71021f9a_765x183.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>When to use the Medallion Architecture</h2><p>The Medallion Architecture is well-suited for specific scenarios:</p><ol><li><p><strong>Complex Data Pipelines:</strong></p><ul><li><p>When dealing with diverse data sources (e.g., streaming, batch, structured, unstructured), the layered approach simplifies ingestion, transformation, and consumption.</p></li></ul></li><li><p><strong>Large-Scale Data Processing:</strong></p><ul><li><p>Organizations with petabyte-scale data benefit from the architecture&#8217;s scalability and ability to leverage cloud storage and compute.</p></li></ul></li><li><p><strong>Data Quality and Governance Needs:</strong></p><ul><li><p>When data quality, lineage, and auditability are critical, the Medallion Architecture provides a structured framework to enforce consistency and traceability.</p></li></ul></li><li><p><strong>Iterative Analytics and ML:</strong></p><ul><li><p>Teams requiring iterative refinement of data for analytics, BI, or machine learning can leverage the Silver and Gold layers for optimized datasets.</p></li></ul></li><li><p><strong>Regulatory Compliance:</strong></p><ul><li><p>Industries like finance, healthcare, or retail that need to retain raw data for compliance benefit from the Bronze layer&#8217;s archival capabilities.</p></li></ul></li></ol><h2>When to Avoid the Medallion Architecture</h2><p>While powerful, the Medallion Architecture isn&#8217;t a one-size-fits-all solution:</p><ol><li><p><strong>Small-Scale or Simple Workflows:</strong></p><ul><li><p>For small datasets or simple ETL processes, the overhead of managing three layers may outweigh the benefits. A single-layer pipeline or traditional data warehouse might suffice.</p></li></ul></li><li><p><strong>Low Data Variety:</strong></p><ul><li><p>If data sources are uniform and require minimal transformation, a simpler architecture (e.g., direct loading into a data warehouse) may be more efficient.</p></li></ul></li><li><p><strong>Limited Resources:</strong></p><ul><li><p>Organizations with constrained budgets or expertise may struggle with the complexity of implementing and maintaining the architecture.</p></li></ul></li><li><p><strong>Real-Time Processing with Minimal Latency:</strong></p><ul><li><p>For use cases requiring ultra-low latency (e.g., real-time fraud detection), the multi-layer processing may introduce unacceptable delays unless optimized with streaming frameworks.</p></li></ul></li></ol><h2>Benefits of using the Medallion Architecture</h2><ol><li><p><strong>Better Data Quality</strong></p><ol><li><p>Data transformations happen gradually, allowing validation rules, schema enforcement, and anomaly detection in the silver layer before data becomes business critical.</p></li></ol></li><li><p><strong>Reproducibility and Traceability</strong></p><ol><li><p>By keeping bronze data as an immutable archive, you can replay pipelines or fix issues downstream without data loss.</p></li></ol></li><li><p><strong>Decoupling</strong></p><ol><li><p>Different teams can consume data from bronze (for exploration), silver (for operational reporting), or gold (for executive dashboards) without stepping on each other&#8217;s pipelines.</p></li></ol></li><li><p><strong>Supports Multiple Use Cases</strong></p><ol><li><p>It fits both batch and near-real-time ingestion patterns, making it easier to evolve your data platform as business needs change.</p></li></ol></li></ol><h2>Trade-offs: What You Should Know</h2><ul><li><p><strong>Increased Complexity:</strong></p><ul><li><p>Managing multiple layers requires robust pipeline orchestration, monitoring, and governance, increasing operational overhead.</p></li></ul></li><li><p><strong>Higher Costs:</strong></p><ul><li><p>Storing data across multiple layers (especially in Bronze and Silver) can increase storage and compute costs, particularly without proper lifecycle management.</p></li></ul></li><li><p><strong>Latency:</strong></p><ul><li><p>The multi-layer processing can introduce latency, making it less suitable for real-time applications unless optimized with streaming tools like Apache Spark or Delta Live Tables.</p></li></ul></li><li><p><strong>Maintenance Overhead:</strong></p><ul><li><p>Schema evolution, data drift, or changes in business logic require updates across layers, which can be time-consuming.</p></li></ul></li><li><p><strong>Skill Requirements:</strong></p><ul><li><p>Implementing the architecture effectively requires expertise in distributed systems, data engineering, and cloud platforms.</p></li></ul></li></ul><p><strong>Example:</strong> In a real-time use case where you need data in dashboards within seconds, having three separate transformations might introduce too much latency.</p><h2>Archiving Decisions in the Medallion Architecture</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lt_c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lt_c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png 424w, https://substackcdn.com/image/fetch/$s_!Lt_c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png 848w, https://substackcdn.com/image/fetch/$s_!Lt_c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png 1272w, https://substackcdn.com/image/fetch/$s_!Lt_c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lt_c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png" width="1080" height="1350" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1350,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:189372,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/169286216?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lt_c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png 424w, https://substackcdn.com/image/fetch/$s_!Lt_c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png 848w, https://substackcdn.com/image/fetch/$s_!Lt_c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png 1272w, https://substackcdn.com/image/fetch/$s_!Lt_c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce256f6-1488-47fd-8154-cd5f0cb19505_1080x1350.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>How the Medallion Architecture Enhances Data Quality</h2><p>Data quality is a cornerstone of effective analytics, and the Medallion Architecture provides a structured approach to ensure high-quality data:</p><ol><li><p><strong>Bronze Layer: Data Integrity:</strong></p><ul><li><p>By ingesting raw data without transformation, the Bronze layer preserves the original data&#8217;s integrity.</p></li><li><p>Schema-on-read and flexible storage formats (e.g., Delta, Parquet) allow validation of data types and structures during ingestion.</p></li><li><p>Archiving raw data ensures a fallback for reprocessing if quality issues arise later.</p></li></ul></li><li><p><strong>Silver Layer: Cleansing and Validation:</strong></p><ul><li><p>The Silver layer enforces data quality through transformations like:</p><ul><li><p><em>Deduplication</em>: Removing duplicate records to ensure consistency.</p></li><li><p><em>Data Validation: </em>Applying rules to check for missing values, outliers, or invalid formats.</p></li><li><p><em>Normalization</em>: Standardizing formats (e.g., dates, currencies) for consistency.</p></li><li><p><em>Enrichment</em>: Joining with reference data to enhance completeness.</p></li></ul></li><li><p>Quality checks (e.g., using tools like Great Expectations or Delta Live Tables) can be integrated to flag or quarantine problematic data.</p></li><li><p>Data lineage is maintained, allowing traceability to identify where quality issues originate.</p></li></ul></li><li><p><strong>Gold Layer: Business-Ready Data:</strong></p><ul><li><p>The Gold layer aggregates and curates data for specific use cases, ensuring it meets business requirements (e.g., accuracy for financial reporting).</p></li><li><p>Aggregations and joins reduce noise, improving reliability for analytics and ML.</p></li><li><p>Governance policies (e.g., access controls, data masking) ensure compliance and protect sensitive data.</p></li></ul></li><li><p><strong>End-to-End Quality Benefits:</strong></p><ul><li><p><strong>Progressive Refinement:</strong> Each layer builds on the previous one, incrementally improving quality through validation, cleansing, and curation.</p></li><li><p><strong>Reusability:</strong> High-quality Silver and Gold datasets can be reused across teams, reducing redundant processing and errors.</p></li><li><p><strong>Auditability:</strong> The layered approach supports tracking data quality issues back to their source, facilitating root-cause analysis.</p></li><li><p><strong>Consistency:</strong> Standardized transformations in the Silver layer ensure consistent data for all downstream consumers.</p></li></ul></li></ol><h2><strong>Conclusion</strong></h2><p>Medallion Architecture is a powerful, proven design pattern that can elevate your data quality, governance, and reproducibility. But it also comes with cost and complexity, so it isn&#8217;t for everyone.</p><p>If you need to scale your data platform, serve many consumers, or enforce data contracts, it&#8217;s likely a great fit. But for simpler or low-latency workloads, a leaner approach might work better.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Demystifying the Internal Workings of HashMap's put and get Methods]]></title><description><![CDATA[HashMaps are a cornerstone of efficient programming, allowing us to store and retrieve key-value pairs with remarkable speed.]]></description><link>https://lakshmimahabaleshwara.substack.com/p/demystifying-the-internal-workings</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/demystifying-the-internal-workings</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Tue, 22 Jul 2025 16:21:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!X4dW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>HashMaps</strong> are a cornerstone of efficient programming, allowing us to store and retrieve key-value pairs with remarkable speed. But what happens under the hood when you call <em>put</em> or <em>get</em> on a HashMap? In this blog post, we&#8217;ll explore the internal workings of these two essential methods, breaking them down step-by-step with the help of flowcharts. </p><h2>What is a HashMap?</h2><p>Imagine a library catalog where books are organized by a unique code. You provide the code, and the system quickly directs you to the exact shelf and slot where your book resides. </p><p>A <strong>HashMap</strong> works similarly; it&#8217;s a data structure that stores data as key-value pairs, using a process called <em>hashing</em> to map keys to specific locations in an underlying array, known as the bucket array. Each position in this array is referred to as a bucket, and multiple entries can end up in the same bucket due to collisions, which are handled using linked lists. HashMaps&#8217; search, insert, and delete operations are extremely efficient, with an average O(1)<em> </em>time complexity.</p><h2>The put Method: Storing Data in a HashMap</h2><p>The put method is how we insert or update a key-value pair in a HashMap. Calling <em>map.put("key", "value") </em>triggers a series of steps that ensure the data is stored efficiently, even when collisions occur. Here&#8217;s how it works, as illustrated in the flowchart:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X4dW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X4dW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!X4dW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!X4dW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!X4dW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X4dW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0447d36-9898-43df-abc4-dff95713f231_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:425541,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/168770848?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X4dW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!X4dW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!X4dW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!X4dW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0447d36-9898-43df-abc4-dff95713f231_1920x1080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Step 1: Calculate the Hash Code</h2><ul><li><p><strong>What happens:</strong> The process starts by computing the hash code of the key using <em>"key".hashCode().</em> This method, built into most programming languages (like Java), generates an integer that represents the key.</p></li><li><p><strong>Why it matters: </strong>The hash code is the first step in determining where the key-value pair will live in the HashMap.</p></li></ul><h2>Step 2: Find the Bucket Index</h2><ul><li><p><strong>What happens: </strong>The hash code is used to calculate the bucket index with the formula:<br><em>index = hashCode(key) &amp; (size - 1)</em><br>Here, size is the number of buckets (default is 16 in many implementations), and &amp; is a bitwise AND operation. For a size of 16, this becomes hashCode(key) &amp; 15, ensuring the index falls between 0 and 15.</p></li><li><p><strong>Why it matters: </strong>This step pinpoints the bucket where the key-value pair should be stored, keeping the mapping consistent.</p></li></ul><h2>Step 3: Check for Hash Collisions</h2><ul><li><p><strong>What happens: </strong>The HashMap checks if the bucket at the calculated index is already occupied. This is a decision point:</p><ul><li><p>No collision: The bucket is empty.</p></li><li><p>Collision: The bucket contains one or more entries (a linked list).</p></li></ul></li><li><p><strong>Why it matters: </strong>Collisions happen when different keys produce the same hash code or index. The HashMap must handle this gracefully.</p></li></ul><h3>If No Collision (Step 4)</h3><ul><li><p><strong>What happens:</strong> If the bucket is empty, the key-value pair is added as the first node in a new linked list for that bucket. The process ends here.</p></li><li><p><strong>Why it matters: </strong>This is the simplest scenario, making insertion quick and straightforward.</p></li></ul><h3>If Collision Occurs (Step 5)</h3><ul><li><p><strong>What happens: </strong>If the bucket has a linked list, the HashMap checks if the key already exists by comparing it to each node&#8217;s key using <strong>"key".equals(existingKey).</strong></p><ul><li><p>Key exists: Move to Step 7.</p></li><li><p>Key doesn&#8217;t exist: Move to Step 6.</p></li></ul></li><li><p><strong>Why it matters: </strong>This ensures the HashMap either updates an existing entry or adds a new one without duplicates.</p></li></ul><h3>Key Doesn&#8217;t Exist (Step 6)</h3><ul><li><p><strong>What happens:</strong> If no matching key is found, the key-value pair is added as the next node in the linked list.</p></li><li><p><strong>Why it matters: </strong>The linked list grows to accommodate multiple entries in the same bucket, resolving the collision.</p></li></ul><h3>Key Exists (Step 7)</h3><ul><li><p><strong>What happens: </strong>If the key is found, the existing value is replaced with the new value provided in <em>put</em>.</p></li><li><p><strong>Why it matters: </strong>This allows the HashMap to update values for existing keys efficiently.</p></li></ul><p>Example in Java</p><pre><code><code>HashMap&lt;String, String&gt; map = new HashMap&lt;&gt;(); 
map.put("apple", "red");  // New entry in bucket 
map.put("banana", "yellow"); // New entry, possible collision map.put("apple", "green"); // Updates "apple" to "green"</code></code></pre><h2>The get Method: Retrieving Data from a HashMap</h2><p>The get method retrieves the value associated with a key, as in <em>map.get("key")</em>. It&#8217;s a search operation that mirrors parts of put but focuses on finding and returning data. Let&#8217;s break it down:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GTSP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GTSP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!GTSP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!GTSP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!GTSP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GTSP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:436259,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/168770848?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GTSP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!GTSP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!GTSP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!GTSP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82f70749-92a0-4941-b9da-930d4eec363b_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Step 1: Calculate the Hash Code</h2><ul><li><p><strong>What happens:</strong> Just like put, it starts with <em>"key".hashCode()</em> to get the hash code.</p></li><li><p><strong>Why it matters:</strong> The hash code identifies the potential location of the key.</p></li></ul><h2>Step 2: Find the Bucket Index</h2><ul><li><p><strong>What happens:</strong> The bucket index is computed using <em>hashCode(key) &amp; (size - 1).</em></p></li><li><p><strong>Why it matters: </strong>This directs the HashMap to the correct bucket to begin the search.</p></li></ul><h2>Step 3: Check the First Node</h2><ul><li><p><strong>What happens:</strong> The HashMap compares the key with the first node&#8217;s key in the bucket&#8217;s linked list using <em>"key".equals(firstNodeKey).</em></p></li><li><p><strong>Why it matters: </strong>This is the entry point to the bucket&#8217;s data.</p></li></ul><h2>Step 4: Are the Keys Equal?</h2><ul><li><p><strong>What happens:</strong> If the keys match, move to Step 5. If not, proceed to Step 6.</p></li><li><p><strong>Why it matters:</strong> This determines whether we&#8217;ve found the target key immediately or need to keep looking.</p></li></ul><h3>Key Found (Step 5)</h3><ul><li><p><strong>What happens:</strong> If the keys match, the value from that node is returned, and the process ends.</p></li><li><p><strong>Why it matters:</strong> This is the goal&#8212;fast retrieval of the value.</p></li></ul><h3>Check More Nodes (Step 6)</h3><ul><li><p><strong>What happens:</strong> If the keys don&#8217;t match, the HashMap checks if there&#8217;s another node in the linked list.</p><ul><li><p>Yes: Loop back to Step 4 for the next node.</p></li><li><p>No: Move to Step 7.</p></li></ul></li><li><p><strong>Why it matters:</strong> The linked list must be traversed fully to ensure the key isn&#8217;t missed.</p></li></ul><h3>Key Not Found (Step 7)</h3><ul><li><p><strong>What happens:</strong> If no matching key is found after checking all nodes, null is returned.</p></li><li><p><strong>Why it matters:</strong> This handles cases where the key doesn&#8217;t exist in the HashMap.</p></li></ul><p>Example in Java</p><pre><code><code>HashMap&lt;String, String&gt; map = new HashMap&lt;&gt;();
map.put("apple", "red");
System.out.println(map.get("apple")); // Outputs: "red"
System.out.println(map.get("orange")); // Outputs: null</code></code></pre><h2><strong>Key Takeaways</strong></h2><ul><li><p><strong>Efficient Lookups:</strong> HashMap offers near-constant time operations for get and put, leveraging the hash code mechanism.</p></li><li><p><strong>Collisions Handled Gracefully:</strong> When collisions occur, HashMap uses a linked list structure within buckets to store multiple items.</p></li><li><p><strong>Overrides and Insertions:</strong> The <code>put</code> method either inserts new entries or replaces existing ones if the key already exists.</p></li></ul><p>Understanding these mechanisms helps write performant, bug-free code, and clarifies why certain operations, like having good<em> </em><code>hashCode</code><em> </em>and <code>equals</code><em> </em>Implementations are essential for objects used as keys in a HashMap.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Unleashing Parallelism: Effects of Global ORDER BY vs PARTITION BY ORDER BY in High-Volume Data Processing ]]></title><description><![CDATA[Introduction]]></description><link>https://lakshmimahabaleshwara.substack.com/p/unleashing-parallelism-effects-of</link><guid isPermaLink="false">https://lakshmimahabaleshwara.substack.com/p/unleashing-parallelism-effects-of</guid><dc:creator><![CDATA[Lakshmi's Notebook]]></dc:creator><pubDate>Sat, 19 Jul 2025 12:29:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BCBI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>Introduction</strong></h2><p>In large-scale data environments, query performance and scalability hinge on how well operations can be parallelized. Key SQL clauses such as ORDER BY and window functions, especially those combining PARTITION BY and ORDER BY, play a central role in determining parallel processing efficiency across big data platforms. This blog explores how ORDER BY works, its role in window functions, and strategies to mitigate its effect on distributed processing.</p><h2>What is ORDER BY?</h2><p>The ORDER BY clause sorts a query&#8217;s result set by one or more columns, either ascending (ASC) or descending (DESC). In Big Data, this is a global operation requiring all data to be sorted across the cluster. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>For Example:</p><pre><code><code>SELECT * FROM employees ORDER BY salary DESC;</code></code></pre><p>This sorts the entire dataset by salary, shuffling data to a single reducer, which limits parallelism.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BCBI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BCBI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png 424w, https://substackcdn.com/image/fetch/$s_!BCBI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png 848w, https://substackcdn.com/image/fetch/$s_!BCBI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png 1272w, https://substackcdn.com/image/fetch/$s_!BCBI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BCBI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png" width="728" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:728,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153212,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/168655452?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BCBI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png 424w, https://substackcdn.com/image/fetch/$s_!BCBI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png 848w, https://substackcdn.com/image/fetch/$s_!BCBI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png 1272w, https://substackcdn.com/image/fetch/$s_!BCBI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F333abf7a-fe27-4899-aa0f-534cbe7fc4aa_728x683.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Window Functions with ORDER BY </h2><p>Window functions perform calculations across a set of rows (a "window") related to the current row. The PARTITION BY clause groups data into partitions, and ORDER BY sorts rows within each partition.</p><p>For Example:</p><pre><code><code>SELECT
  employee_id,
  department_id,
  salary,
  RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank_in_dept
FROM employees;</code></code></pre><p>Here, PARTITION BY department_id groups rows by department, and ORDER BY salary DESC sorts within each partition to assign ranks.</p><h3>Key Components of Window Functions</h3><ul><li><p><strong>PARTITION BY:</strong> Divides the dataset into partitions, allowing parallel processing within each partition.</p></li><li><p><strong>ORDER BY:</strong> Specifies the sorting order within each partition for the window function.</p></li><li><p>Window Function: Examples include <strong>ROW_NUMBER(), RANK(), DENSE_RANK(), SUM(), AVG(),</strong> etc.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uWgI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uWgI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png 424w, https://substackcdn.com/image/fetch/$s_!uWgI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png 848w, https://substackcdn.com/image/fetch/$s_!uWgI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png 1272w, https://substackcdn.com/image/fetch/$s_!uWgI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uWgI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png" width="704" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:704,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:274564,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lakshmimahabaleshwara.substack.com/i/168655452?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uWgI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png 424w, https://substackcdn.com/image/fetch/$s_!uWgI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png 848w, https://substackcdn.com/image/fetch/$s_!uWgI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png 1272w, https://substackcdn.com/image/fetch/$s_!uWgI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6864583-33b2-4fae-8487-8be2b9bb23be_704x1020.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Impact on Parallelism</h2><p>In distributed systems, parallelism enables scalability by processing data across multiple nodes. However, ORDER BY can reduce parallelism:</p><ol><li><p><strong>Global ORDER BY:</strong></p><ul><li><p>Requires shuffling all data to a single reducer for sorting.</p></li><li><p>Eliminates parallelism, creating a bottleneck as all rows are funneled to one node.</p></li></ul></li><li><p><strong>ORDER BY with PARTITION BY:</strong></p><ul><li><p>Partitions allow parallel processing, as each partition is handled independently.</p></li><li><p>Sorting within partitions still requires computational effort, and skewed partitions (uneven sizes) can overload some nodes, reducing parallelism.</p></li></ul></li></ol><p>Example:</p><pre><code><code>SELECT
  employee_id,
  department_id,
  salary,
  SUM(salary) OVER (PARTITION BY department_id ORDER BY salary) AS running_total
FROM employees;</code></code></pre><p>This query computes a running total within each department, sorted by salary. Partitions are processed in parallel, but sorting within large or skewed partitions adds overhead.</p><h2>Practical Example in Apache Spark</h2><p>Let&#8217;s illustrate with a Spark example. Suppose you have a dataset of sales transactions:</p><pre><code><code>from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum, col

# Initialize Spark session
spark = SparkSession.builder.appName("WindowFunctionExample").getOrCreate()

# Sample data
data = [
    (1, "Electronics", 1000),
    (2, "Electronics", 1500),
    (3, "Clothing", 500),
    (4, "Clothing", 700),
    (5, "Electronics", 2000)
]
df = spark.createDataFrame(data, ["sale_id", "department", "amount"])

# Define window specification
window_spec = Window.partitionBy("department").orderBy(col("amount").desc())

# Calculate running total within each department
df_with_running_total = df.withColumn(
    "running_total",
    sum("amount").over(window_spec)
)

# Show results
df_with_running_total.show()</code></code></pre><p>Output:</p><pre><code><code>+-------+-----------+------+-------------+
|sale_id|department |amount|running_total|
+-------+-----------+------+-------------+
|    5  |Electronics|  2000|        2000 |
|    2  |Electronics|  1500|        3500 |
|    1  |Electronics|  1000|        4500 |
|    4  |Clothing   |   700|         700 |
|    3  |Clothing   |   500|        1200 |
+-------+-----------+------+-------------+</code></code></pre><p><strong>Analysis</strong></p><ul><li><p><strong>Partitioning: </strong>The partitionBy("department") ensures that data is grouped by department, allowing parallel processing of "Electronics" and "Clothing" partitions.</p></li><li><p><strong>Sorting:</strong> The orderBy(col("amount").desc()) sorts rows within each partition, which is less resource-intensive than a global sort.</p></li><li><p><strong>Parallelism: </strong>Spark processes each partition independently, but sorting within partitions still requires computational resources. If the "Electronics" partition were much larger, it could slow down the overall job.</p></li></ul><h2>Best Practices for Using ORDER BY in Big Data</h2><ol><li><p><strong>Minimize Global ORDER BY:</strong> Avoid global ORDER BY unless absolutely necessary. Use PARTITION BY to limit sorting to smaller subsets.</p></li><li><p><strong>Monitor Data Skew:</strong> Use tools like Spark&#8217;s UI to check partition sizes and ensure even distribution.</p></li><li><p><strong>Tune Spark Configurations:</strong> Adjust <em>spark.sql.shuffle.partitions </em>to control the number of partitions during shuffling, balancing parallelism and overhead.</p></li><li><p><strong>Profile Queries:</strong> Use query execution plans (e.g., EXPLAIN in Spark or Hive) to identify bottlenecks caused by sorting or shuffling.</p></li><li><p><strong>Limit Sorted Data:</strong> Apply filters (e.g., WHERE clauses) to reduce the dataset size before sorting.</p></li><li><p><strong>Optimize Data Distribution:</strong> Ensure data is pre-partitioned or distributed evenly to avoid skew. For example, in Spark, use <em>repartition()</em> or distribute by to balance data before applying window functions.</p></li><li><p><strong>Use Appropriate Hardware:</strong> Increase the number of nodes or cores to handle sorting tasks in parallel, especially for large datasets.</p></li></ol><h2>Conclusion</h2><p><strong>ORDER BY</strong> in Big Data reduces parallelism due to shuffling and sorting, especially in global operations. Using <strong>PARTITION BY</strong> in window functions helps by enabling parallel processing within partitions, but sorting overhead and data skew can still impact performance. Optimize with careful partitioning, filtering, and tuning to balance functionality and scalability.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lakshmimahabaleshwara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Lakshmi&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>