Cluster Setup¶

Based on the documentation, Ray supports the following cloud providers out of the box: AWS, Azure, GCP, Aliyun, vSphere, and KubeRay. We can also implement the node provider interface to use Ray on other cloud providers like Oracle Cloud but it requires implementing the node provider manually which is a bit more work.

Another option is to use Ray on Vertex AI which is a managed service that allows you to run Ray on Google Cloud. It allows to setup Ray Cluster without setting up the Kubernetes cluster manually.

Aana on Kubernetes¶

Step 1: Create a Kubernetes cluster

The first step is to create a Kubernetes cluster on the cloud provider of your choice. Ray has instructions on how to do this for AWS, Azure, and GCP in Managed Kubernetes services docs.

Step 2: Deploy Ray on Kubernetes

Once you have a Kubernetes cluster, you need to install KubeRay on it. KubeRay is a Kubernetes operator that manages Ray clusters on Kubernetes. You can install KubeRay using Helm. Here is an example of how to install KubeRay on a Kubernetes cluster:

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

# Install both CRDs and KubeRay operator v1.1.1.
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1

# Confirm that the operator is running in the namespace `default`.
kubectl get pods
# NAME                                READY   STATUS    RESTARTS   AGE
# kuberay-operator-7fbdbf8c89-pt8bk   1/1     Running   0          27s

KubeRay offers multiple options for operator installations, such as Helm, Kustomize, and a single-namespaced operator. For further information, please refer to the installation instructions in the KubeRay documentation.

Step 3: Create a YAML file for your application

Next, you need to create a YAML file that describes your Ray application. See the example below to get an idea of what the YAML file should look like:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: <service-name>
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray dashboard agent. Default value is 900.
  serveConfigV2: |
    <serve config generated by aana build>

  rayClusterConfig:
    rayVersion: '2.20.0' # Should match the Ray version in the image of the containers
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      # Pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: <base image for the application>
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            resources:
              limits:
                cpu: "3" # CPU limit for the head pod
                memory: "28G" # Memory limit for the head pod
                ephemeral-storage: "95Gi" # Ephemeral storage limit for the head pod
              requests:
                cpu: "3" # CPU request for the head pod
                memory: "28G" # Memory request for the head pod
                ephemeral-storage: "95Gi" # Ephemeral storage request for the head pod
    workerGroupSpecs:
    # The pod replicas in this group typed worker
    - replicas: 1 # Number of worker nodes
      minReplicas: 1
      maxReplicas: 10
      groupName: gpu-group
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
          - name: ray-worker
            image: <base image for the application>
            resources:
              limits:
                cpu: "3" # CPU limit for the worker pod
                memory: "28G" # Memory limit for the worker pod
                ephemeral-storage: "95Gi" # Ephemeral storage limit for the worker pod
              requests:
                cpu: "3" # CPU request for the worker pod
                memory: "28G" # Memory request for the worker pod
                ephemeral-storage: "95Gi" # Ephemeral storage request for the worker pod
          # Please add the following taints to the GPU node.
          tolerations:
            - key: "ray.io/node-type"
              operator: "Equal"
              value: "worker"
              effect: "NoSchedule"

serveConfigV2 can be generated by the aana build command. It contains the configuration for the Ray Serve applications.

The full file will look like this:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: aana-sdk
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray dashboard agent. Default value is 900.
  serveConfigV2: |
    applications:

    - name: asr_deployment

      route_prefix: /asr_deployment

      import_path: test_project.app_config:asr_deployment

      runtime_env:
        working_dir: "https://mobius-public.s3.eu-west-1.amazonaws.com/test_project.zip"
        env_vars: 
          DB_CONFIG: '{"datastore_type": "sqlite", "datastore_config": {"path": "/tmp/aana_db.sqlite"}}'

      deployments:

      - name: WhisperDeployment
        num_replicas: 1
        max_ongoing_requests: 1000
        user_config:
          model_size: tiny
          compute_type: float32
        ray_actor_options:
          num_cpus: 1.0

    - name: vad_deployment

      route_prefix: /vad_deployment

      import_path: test_project.app_config:vad_deployment

      runtime_env:
        working_dir: "https://mobius-public.s3.eu-west-1.amazonaws.com/test_project.zip"
        env_vars: 
          DB_CONFIG: '{"datastore_type": "sqlite", "datastore_config": {"path": "/tmp/aana_db.sqlite"}}'

      deployments:

      - name: VadDeployment
        num_replicas: 1
        max_ongoing_requests: 1000
        user_config:
          model: https://whisperx.s3.eu-west-2.amazonaws.com/model_weights/segmentation/0b5b3216d60a2d32fc086b47ea8c67589aaeb26b7e07fcbe620d6d0b83e209ea/pytorch_model.bin
          onset: 0.5
          offset: 0.363
          min_duration_on: 0.1
          min_duration_off: 0.1
          sample_rate: 16000
        ray_actor_options:
          num_cpus: 1.0

    - name: whisper_app

      route_prefix: /

      import_path: test_project.app_config:whisper_app

      runtime_env:
        working_dir: "https://mobius-public.s3.eu-west-1.amazonaws.com/test_project.zip"
        env_vars: 
          DB_CONFIG: '{"datastore_type": "sqlite", "datastore_config": {"path": "/tmp/aana_db.sqlite"}}'

      deployments:

      - name: RequestHandler
        num_replicas: 2
        ray_actor_options:
          num_cpus: 0.1


  rayClusterConfig:
    rayVersion: '2.20.0' # Should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      # Pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: europe-docker.pkg.dev/customised-training-app/eu.gcr.io/aana/aana:0.2-ray-2.20@sha256:8814a3c12c6249a3c2bb216c0cba6eef01267d4c91bb58700f7ffc2311d21a3d
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            resources:
              limits:
                cpu: "3"
                memory: "28G"
                ephemeral-storage: "95Gi"
              requests:
                cpu: "3"
                memory: "28G"
                ephemeral-storage: "95Gi"
    workerGroupSpecs:
    # The pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 10
      groupName: gpu-group
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
          - name: ray-worker
            image: europe-docker.pkg.dev/customised-training-app/eu.gcr.io/aana/aana:0.2-ray-2.20@sha256:8814a3c12c6249a3c2bb216c0cba6eef01267d4c91bb58700f7ffc2311d21a3d
            resources:
              limits:
                cpu: "3"
                memory: "28G"
                ephemeral-storage: "95Gi"
              requests:
                cpu: "3"
                memory: "28G"
                ephemeral-storage: "95Gi"
          # Please add the following taints to the GPU node.
          tolerations:
            - key: "ray.io/node-type"
              operator: "Equal"
              value: "worker"
              effect: "NoSchedule"

Let's take a look at a few critical sections of the YAML file:

runtime_env: This section specifies the runtime environment for the application. It includes the working directory, environment variables, and potentially python packages that need to be installed.

The working directory should be a URL pointing to a zip file containing the application code. It is possible to include the working directory directly in the docker image, but this is not recommended as it makes it harder to update the application code. See the Remote URIs docs for more information.

The environment variables are passed to the application as a dictionary. In this example, we are passing a configuration for a SQLite database.

You can also specify additional python dependencies using keys like py_modules, pip, conda. For more information, see the docs about handling dependencies.

You can also change the deployment parameters if needed. You can specify the number of replicas for each deployment or even change the model parameters.

Another important section is the base image for the application. Usually, you can use a pre-built image from the ray project. However, Aana requires some additional dependencies to be installed. It also makes sense to include Aana and all other Python dependencies in the image.

Here is an example of a Dockerfile that includes Aana and ray:

FROM rayproject/ray:2.20.0.0ae93f-py310
RUN sudo apt-get update && sudo apt-get install -y libgl1 libglib2.0-0 ffmpeg
RUN pip install https://test-files.pythonhosted.org/packages/2e/e7/822893595c45f91acec902612c458fec9ed2684567dcd57bd3ba1770f2ed/aana-0.2.0-py3-none-any.whl
RUN pip install ray[serve]==2.20

Keep in mind that this image does not have GPU support. If you need GPU support, choose a different base image from the ray project.

Ideally, we should build a few base images for Aana so they can be used directly in the YAML file without any additional build steps and pushing to the registry.

In the example, we are using Artifact Registry from Google Cloud. You can use any other registry like Docker Hub, GitHub Container Registry, or any other registry that supports Docker images.

Another thing that also needs adjustment is the resource limits and requests. You can adjust them based on your application requirements. But keep in mind that the ephemeral storage needs to be set to a reasonably high value otherwise the application will not deploy.

Step 4: Deploy the application

After creating the YAML file, you can deploy the application to the Kubernetes cluster using the following command:

kubectl apply -f <your-yaml-file>.yaml

This will create the necessary resources in the Kubernetes cluster to run your Ray application.

You can also use the same command to update the application if you make changes to the YAML file. For example, if you want to scale the number of replicas for an ASR deployment, you can set num_replicas: 2 in the WhisperDeployment section and then run kubectl apply -f <your-yaml-file>.yaml again and kubernetes will start another replica of the ASR deployment.

Step 5: Monitor the application

To access the Ray dashboard, you can use port forwarding to access it locally:

kubectl port-forward service/aana-sdk-head-svc 8265:8265 8000:8000

This will forward ports 8265 and 8000 from the Ray head pod to your local machine. You can then access the Ray dashboard by opening a browser and going to http://localhost:8265. The application will be available at http://localhost:8000. The documentation will be available at http://localhost:8000/docs and http://localhost:8000/redoc.

Things to Consider¶

Shared storage¶

The application stores some files on the local disk that will not be accessible from other nodes in the cluster. This can be a problem if the application is deployed on a multi-node cluster. The solution would be to use a shared storage like NFS. This is a recommendation from the Ray documentation. GKE has Filestore that can be used as a shared storage.

Database¶

By default Aana SDK uses SQLite as a database. For cluster deployments, it's recommended to use a more robust database like PostgreSQL. You can use a managed database service like Cloud SQL on GCP or RDS on AWS.