As I mentioned earlier, I have transitioned from Ericsson to Shopify. As part of this transition I start to get a taste of the public transports (previously work was a 15 minutes drive from home, now, working in the city center, I have to take the train to commute). This morning there was a woman just beside me who was working on her computer, apparently editing a document or more probably writing comments in it. A few minutes in her edits, she makes a phone call, asking someone over the phone to change some wording in the document and painfully dictating those changes (a few words). This game went about a few times; writing comments in the document, then calling someone to make the appropriate edits. What could have been a simple edit, then send the edited document via email became an apparently painful exercise in dictation. The point is not to figure out why she was not sending the document via email, it could be as simple as not having a data plan and not willing to wait for a wifi connection, who knows, but the usage of a non-automated “process” made something which is ordinarily quite simple (editing a couple of sentences in a document), a painful dictation experience. This also has the consequence of a limited bandwidth and thus only a few comments can make their ways in corrections on that document.
This reminded me of a conversation I had with a friend some time ago. He mentioned the pride he got of having put in place a data pipeline at his organization for two data sources extraction, transformation and storage in a local database. Some data is generated by a system in his company. Close to that data source he has a server which collect and reduce / transform the data and stores the results in the local file system as text files. Every day he look at the extraction process on that server to make sure it is still running, and every few days he download the new text files from that server to a server farm and a database he usually use to perform his analysis on the data. As you can see this is as well a painful, non-automated process. As a consequence, the amount of data is most probably more limited than it could be with an automated process, as my friend as to cater to the needs of those pipelines manually.
At Shopify I have the pleasure of having access to an automated ETL (Extract Transform and Load) process for the data I may want to do analysis on. If you want to get a feel of what is available at Shopify with respect to ETL, I invite you to watch the Data Science at Shopify video presentation from Françoise Provencher who touch a bit on that as well as the other aspects of the job of a data scientist at Shopify. In short we use pyspark with custom librairies developed by our data engineers to extract, transform and load data from our sources into a front room database which anyone in the company can use to get information and insight about our business. If you listen through Françoise video, you will understand that one of the benefit of that automated ETL scheme is that we transform the raw data (mostly unusable) into information that we store in the front room database. This information is then available to further being processed to extract valuable insight for the company. You immediately see the benefit. Once such a pipeline is established, it perform its work autonomously and as an added benefit, thanks to our data engineering team, monitors itself all the time. Obviously if something goes wrong somebody will have to act and correct the situation, but otherwise, you can forget about that pipeline and its always updated data is available to all. No need for a single person to spend a sizable amount of time monitoring and manually importing data. A corollary to this is that the bandwidth for new information is quite high and we get a lot of information on which we can do our analysis.
Having that much information at our fingertips bring on new challenges mostly not encountered by those who have manual pipelines. It becomes increasingly difficult and not efficient to do analysis on the whole population. You need to start thinking in term of samples. There are a couple of considerations to keep in mind when you do sampling: sample size and what are you going to sample.
There are mathematical and analytical ways to determine your sample size, but a quick way to get it right is to start with a modest random sample, perform your analysis, look at your results and keep them. Then, you redo the cycle a few times and watch if you keep getting the same results. If your results vary wildly, you probably do not have a big enough sample. Otherwise you are good. If it is important for future repeatability to be as efficient as possible, you can try to reduce your sample size until your results starts to vary (at which point you should revert to the previous sample size), but if not, good enough is good enough! Just remember that those samples must be random! If you redo your analysis using the same sample over and over again, you haven’t proven anything. In SQL terms, it is the difference between:
SELECT * FROM table TABLESAMPLE BERNOULLI(10)
Which would produce a random sample of 10% of table, on the other hand:
SELECT * FROM table LIMIT 1000
Will most likely always produce the same first 1000 elements… this is not a random sample!
The other consideration you should keep in mind is about what you should sample, what is the population you want to observe. Let say to use an example in the lingo of Shopify, I have a table of all my merchants customers which amongst other thing contain a foreign key to an orders table. If I want to get a picture of how many orders a customer performs, the population under observation is from the customers, not the orders. In other words, in that case I should randomly sample my customers, then look up how many orders they have. I should not sample the orders to aggregate those per customers and wishing this will produce the expected results.
Visually, we can see that sampling from orders will lead us to wrongly think each customers performs on average two orders. Random resampling will lead to the same erroneous results.
Whereas sampling from customers, will lead to the correct answer that each customer performs on average four orders.
To summarize things, let just say that if you have manual (or even semi-manual) ETL pipelines, you need to automate them to give you consistency, and throughput. Once this is done, you will eventually discover the joys (and need) of sampling. When sampling, you must make sure you select the proper population to sample from and that your sample is randomly selected. Finally, you could always analytically find the proper sample size, but with a few trials you will most probably be just fine if your findings stay consistent through a number of random samples.
Cover photo by Stefan Schweihofer at Pixabay.