Setting Up a Cloud Server for Data Analysis
Last updated: January 18, 2021
Table of contents:
It is often helpful to use a cloud server for long-running analyses, or for working with datasets larger than the memory on your local system. This can be a daunting process if you haven’t done it before. Fortunately there are some great resources available to walk you through this process, which I have compiled here.
Choosing a cloud computing platform
The big 3 cloud computing platform are Amazon Web Services, Google Cloud, and Microsoft Azure. These all work fine, but can be quite complex to set up.
You may also want to look at Digital Ocean and Linode (these are referral links). The setup process and management interfaces for both are much simpler than AWS and they all cost about the same. I personally use Linode by default when I need cloud computing, but do your own research and price comparisons to figure out what will work best for you – and note there are lots of alternatives to these cloud platforms that I haven’t mentioned here.
Choosing a server
When you provision a cloud server, you will need to make some up-front decisions that can be difficult to change later:
- Server specs: This depends on your workload, but I typically recommend starting with a cheap server (usually $5 or $10/month) to experiment with before moving to a more expensive server. Note that if you want to upgrade your server specs, you may need to rebuild your setup from scratch (or partially).
- Operating system: I typically use the newest version of Ubuntu, which is a very popular flavor of Linux. Because Ubuntu is popular, most software will install easily on it.
- Region: Just pick somewhere reasonably geographically close to you. I use Linode’s Newark, NJ region by default.
Security
In a typical setup, a cloud server has a publicly accessible IP address (this is a number like 45.56.111.42
that identifies your server on the internet). This is what allows you to connect to your server remotely, but also allows anyone else on the internet to access your server.
By default, cloud servers have a default username (root
) but you are able to choose the password. Choose a strong, unique password (example: eCEmd9LWRgofT2UfHHLm
). Hackers are constantly scanning IP addresses for root
accounts with bad passwords like password1234
, so if you use a bad password and don’t enable other security features, they may be able to gain full access to your server!
Security best practices include disabling login from root
, requiring SSH key-based authentication, and setting up a firewall. Details are here, here, and here. I strongly recommend reading through those links to at least understand what’s involved. While this all sounds complex, some of these steps are quite simple and will make a big difference in terms of security.
Setting up a server
Instructions for:
As of writing, I prefer Linode’s instructions as they are a more complete walkthrough of the process and include running software updates with apt-get update && apt-get upgrade
.
Reference materials
One of the reasons I like Digital Ocean and Linode is that they provide a lot of free, well-maintained tutorials and guides for working with cloud servers. I’ve linked to some of these already; here are some more you should look at:
- Linux basics (Digital Ocean)
- SSH clients on Windows (Linode); note that Macs can use the built-in Terminal application.
- Using
tmux
(Linode) - Using the
nano
text editor (Linode)
tmux
is especially important. You will typically be interacting with your server via a SSH session, and if this session closes (e.g., you shut your laptop) anything that was running on your sever will by default be canceled. tmux
is a tool that, among other things, allows you to reconnect to a SSH session if you lose your connection.
Installing R and Python
If you are analyzing data, you will want either R or Python:
- Installing R: Digital Ocean / Linode
- Installing Python (Digital Ocean)
Using R
Note that using R on a cloud server is very different from using RStudio1. You launch R by running R
in a SSH session, and you’ll see something like this:
$ R
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
This interface is identical to the “Console” in RStudio. To run a R file, you can upload some_analysis.R
and run source("some_analysis.R")
to execute it on your server.
Uploading and storing data
Just like a laptop, each cloud server comes with some amount of hard drive space. You can upload files from you local computer, and download files as well.
Note that when you delete a cloud server, its hard drive space is also deleted. If you want persistent storage that can be moved among cloud servers, you will want to use something like Block Storage on Digital Ocean or Volumes on Linode. These services are analogous to an external hard drive that you can virtually “plug in” to any cloud server running on their respective platforms.
Block storage differs from “object storage”, which was pioneered by Amazon’s S3 service. Object storage is more like Dropbox than an external hard drive – files in object storage must be manually downloaded onto a cloud server, rather than simply getting access to them when “plugging in” a block storage volume.
- You can install RStudio Server on a cloud server and connect remotely from your local computer, but note this creates additional security issues as you must expose a web server to the public internet to do this. [return]