Skip to content

Working with AWS

Info

If you wish to use LigandScout Remote with a cluster in the AWS cloud, please contact support@inteligand.com additionally to reading this documentation page.

We will streamline the process by providing configuration files and snapshots that are customized for your needs.

Overview

Additionally to installing the server application on one of your physical SGE cluster machines, it is possible to automatically deploy a fully-configured cluster in the AWS cloud. This option enables the use of LigandScout’s remote execution capabilities without access to a physical on-site SGE cluster. AWS allows to deploy arbitrarily large clusters that are billed per hour of up-time.

Reference

AWS Landing Page: https://aws.amazon.com/

Furthermore, HPC clusters in the cloud come with an important advantage over traditional clusters. They are elastic, which means that they can automatically scale up and down, depending on the current workload. The iserver fully exploits this benefit by splitting large jobs into multiple smaller sub-jobs.

Consider the example of a small cluster and a large screening job that is split into several sub-jobs, each planned to run on all CPU cores of a single node. The below figure illustrates this scenario using five sub-jobs and an initial cluster size of two compute nodes. Two of the sub-jobs can immediately start to execute while additional nodes are added to the cluster. These can then be used to execute the three jobs that are waiting in the resource manager’s queue. Once the number of sub-jobs becomes less than the number of nodes, the cluster can start to scale down and reduce running costs.

AWS Account Management

Tip

The screenshots in this section can be enlarged by clicking on them. Use the middle mouse button to quickly open a screenshot in a new tab.

In order to deploy a cluster for virtual screening and conformer generation in the AWS cloud, you will need to set up an AWS account.

Creating Your AWS Account

Simply go to https://aws.amazon.com and click on Create a Free Account.

AWS also provides detailed instructions regarding account creation.

Creating an AWS Access Key

Before it is possible to fully configure AWS ParallelCluster, it is needed to create a so-called AWS Access Key.

This can be done in the My Security Credentials section of your AWS account located in the top right of the AWS management page:

In the Security Credentials Section, you can then simply click Create New Access Key:

Creating a Key-Pair

Finally, in order to be able to log into the created cluster nodes, it is required to generate an AWS EC2 Key Pair. You can navigate to the respective page using
Services -> EC2 -> Key Pairs -> Create Key Pair.

Deployment with AWS ParallelCluster

To facilitate automatic deployment and configuration, we use Amazon’s ParallelCluster tool. It can be installed on your local desktop computer or notebook and is then used to send complex instructions to AWS. This means, you will be able to deploy clusters in the AWS cloud directly from your personal computer.

Info

AWS ParallelCluster was previously named CfnCluster. Both versions work almost exactly the same. However, the former CfnCluster tool uses the cfncluster command instead of the newer pcluster.

The following video briefly illustrates how easy it is to create a cloud cluster and subsequently use it for virtual screening, once AWS ParallelCluster is configured.

Reference

A guide for installing AWS ParallelCluster is located at
https://aws-parallelcluster.readthedocs.io/en/latest/getting_started.html.

The official AWS ParallelCluster Documentation can be found at
https://aws-parallelcluster.readthedocs.io/en/latest/.

Setting up AWS ParallelCluster

Once you have created both your secret AWS Access Key and downloaded a new AWS EC2 Key, you can proceed to configure AWS ParallelCluster.

After intalling AWS ParallelCluster using the instructions provided at above reference, it is required to do some basic configuration. Instructions for this are also given in the official documentation. After finishing the initial configuration step started by

pcluster configure

a configuration file is created in a hidden directory in your home folder: ~/.parallelcluster. The ~ character represents the path to the home folder on your machine.

AWS ParallelCluster Configuration

ParallelCluster offers a large amount of possible configuration parameters. These range from straightforward settings, such as the operating system to be used for the created cluster nodes, to network-related settings.

In our configuration, we use Ubuntu 16.04 as the base operating system (base_os) and SGE as the resource manager (scheduler). Furthermore, an Elastic Block Storage (EBS) snapshot is used that contains a pre-configured server application installation as well as several compound databases . This snapshot is specified in the ParallelCluster configuration (ebs_snapshot_id) to serve as the template for the shared storage volume of all cluster nodes.

Info

Inte:Ligand provides pre-configured snapshots for LigandScout Expert licensees, please contact support@inteligand.com to discuss your use case and to retrieve a snapshot ID.

In order to share a snapshot with you, we need to know your AWS Account ID (Visible under <Account Name> -> My Account -> Account Settings). Furthermore, EBS snapshots have to be created in the same region as the clusters depending on them. Therefore, we also need to know your preferred AWS region (e.g. eu-central-1).

Another important configuration option is the type of compute instance to use (compute_instance_type). Along with the number of used nodes (initial_queue_size & max_queue_size), this is the main cost factor. Amazon offers a large variety of different specifications, starting with one virtual CPU per instance and ranging up to 64 cores per virtual machine. The iserver application is intended to work with compute instances of all sizes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[aws]
aws_region_name = eu-central-1
aws_access_key_id = <AWS_ACCESS_KEY_ID>
aws_secret_access_key = <AWS_SECRET_ACCESS_KEY>

[cluster default]
vpc_settings = public
key_name = <AWS_KEY_NAME>
ebs_settings = custom
post_install = <POST_INSTALL_SCRIPT_DOWNLOAD_PATH>
base_os = ubuntu1604
maintain_initial_size = true
compute_instance_type = c5.2xlarge
initial_queue_size = 1
max_queue_size = 5

[vpc public]
master_subnet_id = <AWS_MASTER_SUBNET_ID>
vpc_id = <AWS_VPC_ID>

[global]
update_check = true
sanity_check = true
cluster_template = default

[ebs custom]
ebs_snapshot_id = <SNAPSHOT_ID>
volume_type = gp2

Reference

Amazon EC2 Instance Pricing List:
https://aws.amazon.com/ec2/pricing/on-demand/

Post-Install Script

Another prerequisite for a ParallelCluster deployment is the post install script. The thereby defined actions are executed on every cluster node after the automatic base setup is finished. This initial base setup includes configuring the network, connecting the shared storage volumes, and initializing the resource manager and job scheduler.

Apart from this, the most important step is, of course, to start iserver.

Below, you can see a typical post-install script for deploying iserver in the cloud. Depending on your exact needs, we will readily provide a variant of this script for you.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
. /opt/parallelcluster/cfnconfig

sudo chown ubuntu:ubuntu -R /shared/

if [ "$cfn_node_type" == "MasterServer" ]; then
  sudo apt-get update -y
  sudo apt-get install default-jre -y
  sudo apt-get install libjemalloc-dev -y
  sudo mv /opt/sge/lib/ /opt/sge/libbckup
  sudo cp -r /shared/lib/ /opt/sge/
  sudo chmod 755 /opt/sge/lib -R
  su - ubuntu -c "screen -dm -S iserver /shared/apps/iserver_1_1_7_linux64/iserver"
fi

Info

You do not need to execute the post-install script yourself. Instead, you only have to define a location from where the script can be downloaded (post_install) in your ~/.parallelcluster/config file, so that it can be automatically executed on all instances added to your cloud cluster. The download path for the above script is http://www.inteligand.com/download/aws/post-install-1-1-7.sh.

The next section covers how to finally create the AWS cloud cluster using the described configuration.

Working with AWS Clusters

Command-Line Interface

AWS ParallelCluster provides a command-line interface for interacting with AWS clusters.

Quote

The different commands are documented in detail in the official AWS ParallelCluster documentation:
https://aws-parallelcluster.readthedocs.io/en/latest/commands.html

The most essential command is create. It is needed to initiate the creation of a new cluster:

pcluster create my-cluster-name

Info

If pcluster create fails to start a new cluster, the --norollback option can be used for debugging purposes. Simply replace the above command by pcluster create my-cluster-name --norollback. Then, the cluster will be created even if parts of the post-install script fail. You can then log into the master node of the cluster using below ssh command. Inspect the /var/log/cfn-init.log and /var/log/cfn-init-cmd.log files for possible failure causes.

For inspecting the resulting cluster or performing arbitrary manual changes, it is possible to log into the master node of a cluster using SSH. The <master-node-ip> is visible in your AWS console and also printed after succesful cluster creation with pcluster create.

ssh -i <path-to-key-file/key-name.pem> ubuntu@<master-node-ip>

This ssh command also shows the credentials needed to access the cluster from within the LigandScout user interface. The username created by AWS ParallelCluster using our configuration is ubuntu. Instead of a password, you need the private key file created earlier and specified in your ~/.parallelcluster/config file via key_name.

In order to save running costs, it is recommended to stop the cluster when it is not needed. It can be done using the stop command:

pcluster stop my-cluster-name

Warning

The stop command does not terminate the cluster completely. The master node and the shared storage volume will continue to run. However, the master node is usually a small and cheap instance. The cost for running volumes is around 0.1€ per month per GB. Therefore, a stopped cluster has only very limited cost implications.

If you want to continue working with a currently stopped cluster, use the start command:

pcluster start my-cluster-name

Once you no longer need a cluster, for example because a snapshot with a newer iserver version is now available, you should delete it:

pcluster delete my-cluster-name

Deletion of a cluster will terminate all resources and therefore no further costs will emerge.

Warning

Cluster deletion will also delete the shared storage volume and all data on it. Make sure to save all screening results locally before deleting a cluster. You can also create a new snapshot from the volume and use it for future clusters.

Creating Snapshots

If you want save the state of your cluster, we recommend to create your own EBS snapshot, which can subsequently be used as basis for creating new cloud clusters.

Snapshots save the state of a running volume. Therefore, to create a new snapshot, head to the Volumes section of the AWS Console under Elastic Block Storage -> Volumes. Then, select the shared volume of your running cluster. If you are using the provided default configuration, this is the volume with 30GB storage capacity. The root volumes of the compute instances and also the root volume of the master node only have 15GB. Then, press Create Snapshot and choose a reasonable name and description for easier classification of your new snapshot.

Under Elastic Block Storage -> Snapshots, you will be able to manage your new snapshot and see its Snapshot ID. In order to use the new snapshot for future clusters, copy the Snapshot ID and set it in your ~/.parallelcluster/config file as ebs_snapshot_id. New clusters created with pcluster create my-cluster-name will have access to all data thas was present on the shared volume used for snapshot creation, including virtual screening results and newly created screening databases.

Warning

Snapshots consume EBS storage and will result in additional costs. However, a single 30GB snapshot will only cost 0.054 * 30=1.62\$ per month.

Uploading Additional Screening Databases

LigandScout Remote allows to create new screening databases from local molecule libraries via conformer generation. If you already have databases you want to use with your AWS cluster, you need to upload them. This can be done using any file transferring technique. Since the remote cloud cluster is already reachable via SSH, we recommend to use scp for this task.

On Linux or Mac systems, you can simply use the following command:

scp -i <path-to-key-file/key-name.pem> -r /path/to/database.ldb ubuntu@<master-node-ip>:/shared/data/compound-databases

On Windows, a separate SCP client, such as WinSCP is needed.

After the upload of the database is complete, refresh the server's database list using the Load Remote Database dialog within the LigandScout GUI.

Starting from version 1.1.7, it is also possible to upload existing databases directly from the LigandScout user interface.

Warning

In any case, it is required that the cloud cluster has enough storage capabilities newly uploaded databases. Please be aware that the configuration provided by Inte:Ligand likely does not cover these additional storage requirements in order to limit AWS charges. However,it is easily possible to increase the cloud storage retrospectively. See the next section.

Increasing the Shared Volume Size

If you plan to upload large additional screening databases, or perform extensive conformer generation jobs, you will likely need to increase the shared storage volume of your cloud cluster at some point. This is a two-step process:

  1. First, log into the AWS web interface (Console) and go to Services -> EC2 -> Elastic Block Store -> Volumes. Here, you will see a list of all currently used volumes. There will be one root volume per compute node, one root volume for the master node, and additionally the shared storage volume that is accessible to all cluster nodes. We need to modify the shared volume, which will be the one with the largest storage capacity. Simply mark it in the list, then click Actions -> Modify Volume. The following dialog will appear:

    Here, you can set the new volume size. Depending on the size of the increase, the volume will be in optimizing state for up to a few hours. During this period, the volume will exhibit less than optimal performance. You can still use your cluster as normal.
  2. After increasing the volume size, it is required to extend the filesystem in order for the change to be recognized. This can be done by executing a single command on your cloud cluster. Access your master node via the command already introduced above:
    ssh -i <path-to-key-file/key-name.pem> ubuntu@<master-node-ip>
    
    Then, execute df -h to verify the current size of the file system for each volume. Identify the entry that is showing the original volume size and is mounted on /shared, usually /dev/xvdb. Finally, excecute the following command for extending the file system:
    sudo resize2fs /dev/xvdb
    
    You can check if the operation succeded by using df -h once more.

Info

If you want to create additional clusters with your new volume size, create a snapshot as described in Creating Snapshots.

References

Cost Management

The cost of cloud clusters is surprisingly low. Nevertheless, it is essential to carefully monitor all charges and to consume only resources that are actually needed.

A good starting point for staying informed about your costs is your AWS Billing Dashboard.

References

Example Cost Calculation

The current default configuration for LigandScout Remote clusters specifies a maximum number of five c5.2xlarge instances. Five is the maximum amount of c5.2xlarge instances that can be run at the same time by new AWS customers. A c5.2xlarge instance costs 0.388 \frac{\$}{h} and comprises 8 compute optimized CPUs, which means that the full cluster provides 40 cores.

Warning

The prices given in this section depend on the used AWS region, are subject to frequent change, and might be different at the time of reading. However, prices usually go down when newer instance types are released. Nevertheless, be sure to check current prices using above references.

To calculate the full cost of running this cluster, several factors have to be considered. We start by calculating the running costs of all 5 compute instances. Additionally, there are charges for the master node instance. Per default, the instance type for the master node (specified with master_instance_type) is t2.micro. The price for this instance is 0.0134 \frac{\$}{h}.

5 * 0.388 + 0.0134 = 1.9534 \frac{$}{h}

The charges for the storage volumes associated to the cluster are calculated as follows:

Compute instances root volumes: $$ 5 * 15 = 75 GB $$ Master Node Root Volume and shared storage volume: $$ 30 + 15 = 45 GB $$ Total storage volumes capacity: $$ 45 GB + 75 GB = 120 GB $$ Total storage volumes cost for cluster running at full capacity (5 compute nodes), assuming 720 hours per month: $$ 120 * \frac{0.119}{720} = 0.02 \frac{$}{h} $$ The total cost for the cluster running at full capacity is therefore: $$ 1.9534 + 0.02 = 1.9734 \frac{$}{h} $$ If the cluster is stopped, only the prices for the master node instance, the master node root volume and the shared volume apply: $$ 0.0134 + 45 * \frac{0.119}{720} = 0.02 \frac{$}{h} $$

If the complete cluster runs for a full month, every hour of every day, the total cost will be around 1.9734*720=1420.85\$. This may sound expensive, but keep in mind that the cluster will automatically scale down to only one running compute instance, when the queue is empty. A cluster that is stopped via pcluster stop my-cluster-name will cost around 0.02*720=14.4\$ per month. If the cluster runs for 2 hours every day of a 30-day month and is stopped the remaining time, it will cost around 30*2*1.9734 + 30*22*0.02=131.6\$.

Cost Reduction using Spot Instances

AWS does not only offer on-demand instances, which are provided for a fixed cost at all times. Additionally, it is possible to use spot instances. The so-called spot price depends on current supply and demand and is up to 90% cheaper than the price for on-demand instances.

If you want to use spot instances, simply include the following line in your ~/.parallelcluster/config file, before creating your cluster:

cluster_type = spot

Using the above setting, AWS will charge the current spot market price, capped at the on-demand price. The price is therefore guaranteed to be lower or equal compared to using on-demand instances. If you want to further reduce running costs, you can also set the maximum price you are willing to pay per compute instance per hour:

spot_price = 0.15

If you specify a spot_price, AWS will terminate your instance once the current spot market price exceeds your limit. Even without specifying an explicit spot_price, it is possible that the instance is terminated when the available supply of spot instances approaches 0.

However, this does not pose a problem to LigandScout Remote. Both screening and conformer generation jobs are fault tolerant. In practice, this means that failed sub-jobs can easily be restarted without having to repeat the complete job. This feature is accessible through LigandScout and KNIME by right-clicking on a failed sub-job in the respective job monitoring dialogs.

Warning

Restarting parts of a failed job is only possible with iserver 1.1.7 or newer. If you are using an older version, you need to restart the complete job in case a cluster node is terminated unexpectedly. Therefore, we strongly recommend to update to the most recent version, in case you want to use AWS spot instances.

Reference

AWS Spot Instances Overview
https://aws.amazon.com/ec2/spot/

Official AWS Spot Instances Documentation
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html

AWS Parallelcluster Spot Instances Settings
https://aws-parallelcluster.readthedocs.io/en/latest/configuration.html?#spot-price