Shopify E-commerce Store Item Distribution: How We Automated Our Marketing Research

Posted By Seller Panda comments April 8, 2015

Shopify E-commerce Store Item Distribution: How To Automate Your Marketing Research

Here at SellerPanda, we build apps for e-commerce, mostly for Shopify. We get our ideas from all kinds of places, but mostly from…talking to users. We figure that if one user tells us he could use an app, there’s a pretty good chance that at least a few more users out there feel the same way. In fact, if he could verbalize this need, then there are probably quite a few others out there who have the same need but can’t verbalize it.

Still, this is a very fuzzy assumption to go on. We’re a startup. We only have one developer. If he takes a month to build an app which only 50 users install, we’re going to have to eat our shoe like Charlie Chaplin.

(didn’t validate market size)

So we try to figure out a way to prove that there is a market for the app before we put any development time into it at all. Before we even make mockups. Preferably, automatically, with a minimum of effort. We’d also like to figure out how much we should charge for the app, if possible. I’d like to show how we do that in this article.

In this case, we were doing research for our next app, which lets Shopify store owners edit all their items one by one on a single page, without having to load, save etc. Right now, this is a big pain, where you have to load each item, edit it and save it, then select the next item and so forth. Try doing that with 50 items. So the app’s utility should increase linearly with the number of users, meaning that someone with 100 items in his store will find it twice as useful as someone with 50 items, and so forth.

So, we said, let’s find out the distribution of items in Shopify stores.

How the hell do we do that?

Well, we got a comprehensive list of all sites built using Shopify from our friends at BuiltWith. There are 100,000 active Shopify sites (interestingly, 150,000 sites have been built with the platform.)

So we had a .csv file with 100K URLs. What now?

We very briefly thought about hiring someone from overseas to manually go and count products, but then dismissed the idea as insane. In retrospect, we could have gotten an acceptable result, with a margin of error, if we’d had them count a sample, say, 1000 sites, and then extrapolated. But we didn’t want to do that. It wasn’t really elegant.

What if, we said, we had a robot crawl each site and download all its data? Then we could have some kind of search engine analyze the data and spit out our results!

Some time spent looking at the Shopify API told us that making a “collections” call would return a listing of the site’s collections. And each collection would, frequently, have the number of products or items listed underneath it. Of course, the sites were not standardized, and would not behave in a uniform fashion. But generally, we saw that about 80% of the ones we looked at followed a certain pattern.

So we spent a few days attempting to set up Nutch and Solr, an open source crawler and search engine. I think of this period as my descent into Apache hell. It’s got cores. And other stuff. And books and books written about it. I’m sure that to our developer, this would have been a breeze to figure out, but the whole point of this process was to save developer time.

Nutch-Solr integration











(It’s…easy. Yeah.)

Eventually, we just said, the hell with it. Building our own parser would be easier than attempting to configure Nutch and Solr. So we did.

The parser takes up less than 100 lines of PHP[i]. All it does is:

  • Opens the .csv file with all the URLs.
  • Reads each line and pulls out its base URL.
  • Puts the API “collections” call on the end of the URL and curls the resultant HTML as a string.
  • Checks the HTML string for color swatches (we just threw this in there to see how many customers might be able to use our Swatchify)
  • Checks the string for substrings like “item” or “product” or variations thereof using regex.
  • Counts the numbers of items or products or what have you, looping through the string.
  • Saves the total in an SQL database.
  • Moves on to the next URL.

This might not have been the fastest or most elegant way to get the job done, but it got it done.

Once the parser had run through every Shopify site on the list, we looked at the output. The limit for items on a page with a “collections” call is 250. When we plotted the output in a scatter plot:

Shopify store item distribution













Given that the API cuts us off at 250 items per site, this looks an awful lot like a power law distribution. Actually, it looks like a Pareto distribution. Something like the famous 80-20 rule, where in a society, 80% of the wealth is owned by 20% of the population, and then of that 80%, a further 80% is owned by a 20% of the 20%, and so on.

So what can we understand from this? About 30,000 users meet the threshold we set above which our app would be minimally useful, i.e., 50 items in their Shopify store. About 20,000 have 100 items or more. Approximately 12,000 have 250 items or more. And so on. We can extrapolate the rest of the curve if necessary, using MATLAB and similar tools, but there’s no need. We know that there is a significant population of users for our app, and within that population, there’s a significant subpopulation who would have a very high utility for the app. We can thus set thresholds for price and target fairly precisely, knowing our serviceable available market quite accurately.

All of that before an hour of developer’s or designer’s time has gone into the product.



[1] I’m certain that this code is not optimized and could be made much better. If you can use it, here it is:

ini_set(‘max_execution_time’, 36000);
$servername = “XXXX”;
$username = “XXXX”;
$password = “XXXX”;
$conn = new PDO(“mysql:host=$servername;dbname=XXXXX”, $username, $password);
$file = fopen(“XXX.csv”,”r”);
while(! feof($file))
$arrayline = fgetcsv($file);
$domain = $arrayline[0];
$url = $arrayline[3];

if($i >1 && strlen($url) > 5 )
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, ‘http://’.$url.’/collections’);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$itemTotal = 0;
if (preg_match(“/swatch/”, $output))
$swatch = 1;

$swatch  =0;
if (preg_match(“/\d\s\Droduct/”, $output))
$guts = explode(“roduct”, $output);
elseif (preg_match(“/\d\s\Dtem/”, $output))
$guts = explode(“tem”, $output);
echo ‘The item total is 0<br>';

foreach($guts as $node)
preg_match(“a+\d\s.$”, $node, $matches[0]);
$itemTotal += (int)$matches;
if ($node === end($guts))
echo ‘The item total is ‘.$itemTotal.'<br><br>';
$stmt = $conn->prepare(“SELECT id FROM crawler WHERE url='”.$url.”‘”);
if ($stmt->rowCount() == 0) {
$stmt = $conn->prepare(“INSERT INTO crawler (url, total_items, swatch)
VALUES (:url, :total_items, :swatch)”);
$stmt->bindParam(‘:url’, $url);
$stmt->bindParam(‘:total_items’, $itemTotal);
$stmt->bindParam(‘:swatch’, $swatch);
}else{$stmt = $conn->prepare(“UPDATE crawler SET total_items='”.$itemTotal.”‘, swatch='”.$swatch.”‘, update_date=sysdate() WHERE url='”.$url.”‘”);




downloadBaruch Kogan is the marketing and business development lead for SellerPanda, an Israeli startup building e-commerce apps. If you liked this article, please share it on your social network, and check out SellerPanda’s Twitter feed and blog. If you have thoughts to share or questions to ask, please email him at And don’t forget to check out our next app!


2 thoughts on “Shopify E-commerce Store Item Distribution: How We Automated Our Marketing Research

Leave a Reply

Leave a Comment.