How to permanently install a module on Google Colab.

How to permanently install a module on Google Colab.

One of the things nobody tells you about data science is that training models can take an eternity, if you don't believe me, try adding 1million parameters to your grid search. My current system cannot train a Catboost base model, don't even get me started on XGBoost, or Grid search but will I give up on being a data scientist?

tenor (1).gif

Thanks to Google for giving us Colab, the interface is like that of Jupyter notebook. If you haven't used Colab before, you should definitely check it out, if you need a quick intro on how to use the interface check this get started with Google Colaboratory.

images (39).jpeg

Not only does Google Colab make training models faster, but you can also use GPUs and TPUs. If you want to know how to enable them, check out How to Use Kaggle and Google Colab with GPU Enabled. Due to these features, it's equipped to handle heavy computations such as hyper-parameter optimization & Computer Vision.

If you've been using Google Colab for a while like me, you may have discovered that any time you need a module that doesn't come pre-installed, you'll have to install it again every time you need it, anytime I open my Colab notebook and I have to install a module again I'll be like

here_we_go_again-DMID1-5ojhw3qms-460x368.gif

This doesn't just irritate me because of the stress, it wastes my data and if you live in Nigeria like me, you'll know that every MB counts. Imagine having to install Catboost, XGBoost, and Pycaret every time I want to start a new notebook, that's a whole lot of data wasted but that was what I did until one day I said 200.gif

First, we've to understand a problem before solving it, let's ask the obvious question, "Why do we need to reinstall a module every time we want to use it on Colab?", the answer is somewhat simple. The reason why we don't need to reinstall a module anytime we need it on anaconda or our system is because when we 'pip' or 'conda' install a module, the file that contains this module is saved to our system. Hence, anytime we need this module again, we simply import it from that directory. Unfortunately, Google Drive API doesn't work the same way, when we install a package, it is accessible for that session but after the session, the path resets and the module will not be found if we try to import it again.

How do we then solve this? We simply find a way to replicate the mechanism that works on our machine, we find a way to save this module as a file to a path on our Google Drive where we can simply go and import it when we need it again.

Let's do some coding to implement this solution. We'll use Spacy as an example, Spacy is a Python library, used for Natural Language Processing. You can check the documentation at spacy.io, I'm using this module because I've never installed it on my Colab. tenor (1).gif

  1. Open Google Colab
  2. Mount Google Drive
    import os, sys 
    #to be able to interact with Google Drive's operating system
    from google.colab import drive 
    #drive is a module that allows us use Python to interact with google drive
    drive.mount('/content/gdrive') 
    #mounting google drive allows us to work with its contents
    nb_path = '/content/notebooks'
    os.symlink('/content/gdrive/My Drive/Colab Notebooks', nb_path)
    sys.path.insert(0, nb_path)  # or append(nb_path)
    #The last three lines are what changes the path of the file.
    
  3. Install the module in the notebook folder permanently

    !pip install --target=$nb_path spacy
    

    In your case, just change 'spacy' to the name of the module you wish to install.

  4. Import the module to confirm that it has been installed

    import spacy
    

    The bulk of our work is done

UnrealisticFilthyBullfrog-size_restricted.gif

One rule of programming is that you should write tests to validate that your code is doing what it's supposed to, of course you could be like this guy download.jpg

but I'll advice you against that. Now let's test if the module will be imported if we come back next time. To test this,

  1. Open a new Colab notebook
  2. Try to import the module
import spacy

You should get an error, why? because you haven't switched the path to where the module is located. Let's import it the right way

  1. Mount your Google drive
    from google.colab import drive
    drive.mount('/content/gdrive')
    
  2. Change the path to the directory where the module is located
    import sys
    sys.path.append('/content/gdrive/My Drive/Colab Notebooks')
    
  3. import the module
    import spacy
    

200 (2).gif

This should work like a charm, if it doesn't work for you go over the article you must have missed something, try to correct yourself and you should be fine.

Anytime you need to import the module, just repeat the last three steps and you'll be fine.

Thanks for reading, let me know what you think I should write on next time in the comment.

Merry Christmas and a Happy New Year in Advance.