Orc table creation from spark sql with snappy compression

11/19/2023

On OLTP requirements support means as, if any record is Deleted or updated then immediately this change would not be reflected in applications accessing data. ORC supports streaming ingest in to Hive tables where streaming applications like Flume or Storm could write data into Hive and have transactions commit once a minute and queries would either see all of a transaction or none of it. Although ORC support ACID transactions, they are not designed to support OLTP requirements. Usecase comparison between ORC and ParquetĪCID transactions are only possible when using ORC as the file format. More the number of columns the more advantageous is the columnar storage.īigdata is more of aggregate operations, applying MIN, MAX, SUM or any aggregation on a column is faster in columnar format as the control is directly acting upon column. Typically in a warehouse DB there would be 50+ columns in a table as the data would be in a normalized form. So aggregation applied on a particular column set is many times faster than applying aggregation applied on row based set. Not as beneficial when the input and outputs are about the same.Ĭolumnar is best for bigdata usecases as majority of analytical queries rely on aggregation kind of analysis. Spark natively supports ORC data source to read and write an ORC files using orc() method on DataFrameReader and DataFrameWrite.Columnar is great when your input side is large, and your output is a filtered subset: from big to little is great. In summary, ORC is a high efficient, compressed columnar format that is capable to store petabytes of data without compromising fast reads. Val df=spark.createDataFrame(data).toDF(columns:_*) Val columns=Seq("firstname","middlename","lastname","dob","gender","salary") Val spark: SparkSession = SparkSession.builder() For smaller datasets, it is still suggestible to use ZLIB. If you have large data set to write, use SNAPPY. ZLIB is slightly slower in write compared with SNAPPY.When you need a faster read then ZLIB compression is to-go option, without a doubt, It also takes smaller storage on disk compared with SNAPPY.

Below are basic comparison between ZLIB and SNAPPY when to use what.

Hence, it is suggestable to use compression. Not writing ORC files in compression results in larger disk space and slower in performance. The example below explains of reading partitioned ORC file into DataFrame with gender=M. When you check the people.orc file, it has two partitions “gender” followed by “salary” inside. Following is the example of partitionBy().

In Spark, we can improve query execution in an optimized way by doing partitions on the data using partitionBy() method. When we execute a particular query on PERSON table, it scan’s through all the rows and returns the results the selected columns back. |firstname|middlename|lastname| dob|gender|salary| Here, we created a temporary view PERSON from ORC file “ data” file.

Spark.sql("CREATE TEMPORARY VIEW PERSON USING orc OPTIONS (path \"/tmp/orc/data.orc\")") In order to execute SQL queries, create a temporary view or table directly on the ORC file instead of creating from DataFrame. Now let’s walk through executing SQL queries on the ORC file without creating a DataFrame first. In this example, the physical table scan loads only columns firstname, dob, and age at runtime, without reading all columns from the file system. Val orcSQL = spark.sql("select firstname,dob from ORCTable where salary >= 4000 ") These views are available until your program exits. We can also create a temporary view on Stark DataFrame that was created on ORC file and run SQL queries. In order to read ORC files from Amazon S3, use the below prefix to the path along with third-party dependencies and credentials.

0 Comments

discovery guide

Orc table creation from spark sql with snappy compression

Leave a Reply.

Author

Archives

Categories