Saturday, 25 April 2020

Anaconda Vs Miniconda Vs Vanilla Python - Which one to use ? Confused ?

Have you heard about Anaconda or Miniconda ?


If you are a python developer or someone who is interested to learn python, you should definitely know this.

Here I will be explaining these from an individual user's perspective.

Anaconda - This is one of the world's popular opensource python distribution with lot of machine learning and datascience packages. This is the best option for someone who is interested to learn datascience & machine learning. This is like a complete ecosystem of libraries and packages for a data scientist or an advanced python programmer.

Miniconda - This is a minimal version of anaconda.  It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others.

Both Anaconda and Miniconda comes with conda package manager. Using conda, we can install the required package from several thousands of cloud repositories. So it is very easy to find and install dependent packages using conda.

Official Python - This is the standard version of Python. This uses pip package manager.

I would recommend miniconda for using as the development environment in windows for basic level python developers as it will be easy to handle the dependencies. Some of the packages will cause trouble while installing using pip in windows.

Tuesday, 26 March 2019

How to run python applications as system service in CentOS 7 or RHEL 7 ?

I started developing python applications from 2012. During my learning period, I was using screen in linux for running my python applications in the background. Majority of my colleagues were also following the same approach for running applications in background. The advantage of this approach was its easiness. But these screens will go away if the system gets rebooted. Also these applications will never get started on boot.

Systemd is a system and service manager for CentOS 7 and RHEL 7. This the first process that gets started after the system boots. This is compatible with the SysV init scripts used by CentOS 6 and RHEL 6.

It is very easy to run any custom application as a system service. In this way, the management will become very easy.

Suppose I have a python flask application app.py and I want to run it as a system service. I want to run this application under my unix user account amal.  

The contents of app.py is given below.




My code is located in the location /home/amal. Usually in case of any production deployment, the code will not be kept in the user home directory. This is to explain the scenario of running the application as a non root user.

The command to run my application is


python app.py 

For running in production mode with a WSGI like gunicorn, the command will be
gunicorn app:app --bind 0.0.0.0:8080 -w 2

For running this application as system service. We just need to do the following steps.

Create a file myapp.service in the location /etc/systemd/system/ The content of myapp.service is given below.


Now save the file and execute the following command. The following command will reload all the systemd configurations.
systemctl daemon-reload

Now we can start the application with the following command
systemctl start myapp

Now check the status of the application with the following command
systemctl status myapp

Also we can invoke the webservice using curl command.
curl -X GET http://localhost:8080/


How to check the CPU and Memory utilization of a Linux system ?

One of the best and easiest way to perform the realtime CPU and Memory utilization  of a Linux system is by using htop command.

The command will provide an interactive user interface with the summary of load on the system and the details of individual process running on the system. This is very helpful to identify the resource utilization of the applications running in a system.



Monday, 25 March 2019

How to list all the indices in Elasticsearch ?

For listing all the indices from unix command line, type the following command.

curl -X GET http://eshost:9200/_cat/indices


For checking this from the browser, simply type the following URL in the browser

http://eshost:9200/_cat/indices



Where eshost is the hostname of the elasticsearch server. You can replace it with the IP address of the server also. 9200 is the default port used by elasticsearch

What is Inode in Linux ? How to check the Inode usage in Linux ?

The inode is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block location(s) of the object's data. File-system object attributes may include metadata (times of last change, access, modification), as well as owner and permission data. Inode is also known as Index node.

Directories are lists of names assigned to inodes. A directory contains an entry for itself, its parent, and each of its children. Sometimes, we may face space issue in disks which has storage space. Mostly this issue occurs because of inode filling. The file system might have storage space available, but the inode table does not have space.

The command to check the storage space is given below.

df -h

The command to check the inode usage is given below.

df -i

If the inode capacity of a disk is full, we will not be able to write any data to the disk. Usually this issue happens when we keep large number of small files.

Tuesday, 22 August 2017

How to list all the local users in a linux system ??

The following command can be used to list all the local users in linux.

cut -d: -f1 /etc/passwd

This will list all the local users in the system. This will not list all the users if it is integrated with LDAP or any other similar systems

Sunday, 4 January 2015

How to import data from RDBMS to Hadoop and viceversa ?

Hadoop became very popular within few years because of its robust design, open source and ability to handle large data. Nowadays lot of RDBMS to hadoop migration projects are happening. Hadoop is not a replacement for the RDBMS, but for certain usecase, hadoop can perform well than RDBMS. Some projects may require data from rdbms along with multiple sources for finding insights. In these scenarios, we need to transfer data from RDBMS to hadoop environment. This task sounds simple, but this is a difficult task as this involves lot of risk. The possible solutions for importing data from RDBMS to hadoop are explained below

1) Using SQOOP
Sqoop is a hadoop ecosystem component that is developed for importing data from RDBMS to hadoop and for exporting data from hadoop to RDBMS. Sqoop jobs runs as a mapreduce job. Sqoop utilizes hadoop's parallelism for doing the parallel import and export. Internally sqoop is running as a mapper alone job that utilizes jdbc. For using sqoop, we need a good network connectivity between the RDBMS environment and hadoop environment.

2) By dumping the data from database and transferring via portable secondary storage devices
Most of the companies may not allow direct network connectivity to RDBMS environment from hadoop. Another reason for not allowing is that when a sqoop job is triggered the data flow through the network will be very high which will affect the performance of other systems connected to the network. In such cases, data will be transferred to the hadoop environment by dumping the data from the database, copying it to some portable secondary storage devices or some cloud storage (if allowed) and transferring the data to hadoop environment.

How to check the memory utilization of cluster nodes in a Kubernetes Cluster ?

 The memory and CPU utilization of a Kubernetes cluster can be checked by using the following command. kubectl top nodes The above command...