Hadoop Hive-With Practical
Hive is a data warehouse system which is used for querying and analysing large datasets stored in HDFS. It process structured and semi-structured data in Hadoop.
Metastore: stores metadata for Hive tables (like their schema and location) and partitions in a relational database(traditional RDBMS format).
Driver: acts like a controller which receives the HiveQL statements. It monitors the life cycle and the progress of the execution of the HiveQL statement. it stores the necessary metadata generated during the execution of a HiveQL statement. it does the compiling and optimizing and executing of the HiveQL statements.
Compiler: It performs the compilation of the HiveQL query. it converts the query to an execution plan which contains the tasks(of mapreduce).
Optimizer: It performs various transformations on the execution plan to provide optimized plan. It aggregates the transformations together, such as converting a pipeline of joins to a single join.
Executor: Once compilation and optimization complete, the executor executes the tasks.
Beeline: The Beeline is a command shell supported by HiveServer2, where the user can submit its queries and command to the system.
Hive Server 2: enhanced version of Hive Server 1 wich allows multiple clients to submit requests to Hive and retrieve the final results. It is basically designed to provide the best support for open API clients like JDBC and ODBC and Thrift.
There are so many features of Apache Hive. Let’s discuss them one by one-
- Hive provides data summarization, query, and analysis in much easier manner.
- Hive supports external tables which make it possible to process data without actually storing in HDFS.
- Apache Hive fits the low-level interface requirement of Hadoop perfectly.
- It also supports partitioning of data at the level of tables to improve performance.
- Hive has a rule based optimizer for optimizing logical plans.
- It is scalable, familiar, and extensible.
Lets do with some practical .. !!
One of the Interview Question , Which I Had completed along my Learning way
PROJECT DESCRIPTION :
The project aims to display Beverages-to-Multiple Branches of one Coffee Shop using Hive. In other words, each Beverage might be available on many Branches, and each Branch of the Coffee shop might distribute many Beverages. Assuming each branch send their sales report as a csv file. The project aims to stage them to HDFS and further analysis to be performed using Hive for the given problem statement below.
The description of the data is as below :
- Beverages Name does not have spaces.
- Coffee shop Branches have mentioned as Branch1, Branch2 and etc.
- Beverages can be ordered many times, with different counts
- A Beverage and Branch combination might appear multiple times
- Beverages could be available on multiple Branches
- The output should have no commas or punctuation, only 1 space between the Beverages and Count of Consumed people.
- What is the total number of consumers for Branch1?
- What is the number of consumers for the Branch2?
- What is the most consumed beverage on Branch1?
- What are the beverages available on Branch10, Branch8, and Branch1?
Placing the given dataset in HDFS Create directory in HDFS
Step 1: Before creating directories in HDFS, ensure all the daemons in hadoop are started. The below code is for creating directory called “project1” as follows:
$ hadoop fs -mkdir /user/hive/project1
Placing the input files in the HDFS directory
Step 1: Copying all the given dataset files from local to HDFS directory in a separate directory. The code as follows:
$ hadoop fs -copyFromLocal Bev_BranchA.txt /user/hive/project1/Bev_BranchA.txt
$ hadoop fs -copyFromLocal Bev_BranchB.txt /user/hive/project1/Bev_BranchB.txt
$ hadoop fs -copyFromLocal Bev_BranchC.txt /user/hive/project1/Bev_BranchC.txt
$ hadoop fs -copyFromLocal Bev_ConscountA.txt /user/hive/project1/Bev_ConscountA.txt
$ hadoop fs -copyFromLocal Bev_ConscountB.txt /user/hive/project1/Bev_ConscountB.txt
$ hadoop fs -copyFromLocal Bev_ConscountC.txt /user/hive/project1/Bev_ConscountC.txt
Step 2: After the Step 1, check whether the files got placed in the HDFS in browser. Implementation in HIVE
Creating HIVE DB:
Step 1: Create a database in the name “hadoophiveeproj” in HIVE.
The code as follows:
Hive> create database if not exits hadoophiveproj with dbproperties(‘creator’=’Abhay’,’date’=’2/1/2022’);
Creating & Loading the HIVE tables with the given datasets
Step 1: Create separate raw tables for the Beverages-Counsumercount different datasets each in “hadoophiveproj” database
The codes for creating tables are as follows,
Hive> use hadoophiveproj;
Hive> create table if not exists BevcountA (beverage string,count int) row format delimited fields terminated by ‘,’;
Hive> create table if not exists BevcountB(beverage string,count int) row format delimited fields terminated by ‘,’;
Hive> create table if not exists BevcountC (beverage string,count int) row format delimited fields terminated by ‘,’;
Step 2: Loading the Beverage -Number of consumers’ raw tables from the given text files individually. The code as follows,
Hive> load data inpath ‘/user/hive/projinput/Bev_ConscountA.txt’ into table BevcountA;
Hive> load data inpath ‘/user/hive/projinput/Bev_ConscountB.txt’ into table BevcountB;
Hive> load data inpath ‘/user/hive/projinput/Bev_ConscountC.txt’ into table BevcountC;
Step 3: Create separate raw tables for the Beverages-Branches different datasets each in “hadoophiveproj” database.
The codes for creating tables are as follows,
Hive> create table if not exists BevbranchA(beverage string,branch string) row format delimited fields terminated by ‘,’;
Hive> create table if not exists BevbranchB(beverage string, branch string) row format delimited fields terminated by ‘,’;
Hive> create table if not exists BevbranchC(beverage string, branch string) row format delimited fields terminated by ‘,’;
Step 4: Loading the Beverage type-Branch raw tables from the given text files individually. The code as follows:
hive> load data inpath ‘/user/hive/projinput/Bev_BranchA.txt’ into table BevbranchA;
hive> load data inpath ‘/user/hive/projinput/Bev_BranchB.txt’ into table BevbranchB;
hive> load data inpath ‘/user/hive/projinput/Bev_BranchC.txt’ into table BevbranchC;
These is the process for setting up the environment with creating the tables and loading the data in the table
If you want to execute the Query and try yourself ,Give it a try
or else I have updated the answer with the dataset in the below github link you can follow that:
GitHub - abhaydesai01/hadoop-hive-project
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…
I Hope you like it