Part 2: Query CSV & JSON files
What is this article about?
In this article, we will start querying CSVs and JSONs in Athena and creating new tables from existing tables (ETL Process)
This article is mainly for BI developers who want to extend their capabilities to handle Big Data, and finished successfully part one.
Understanding the “Create Table” Statement
In this case, I needed to create 2 tables that holds you tube data from S3.
I will present two examples - one over CSV Files and another over JSON Files.
After that, we will create tables for those files, and join both tables.
The whole process is as follows:
Query CSV Files
- Download the attached CSV Files. Because the data is structured - this use case is simpler.
The Table is for the Ingestion Level (MRR) and should be named – YouTubeVideosShorten.
- Create the Folder in which you save the Files and upload both CSV Files.
- Follow the instructions from the first Post and create a table in Athena
- After creating your table – make sure You see your table in the table list.
The Table for the Ingestion Level (MRR) YouTubeVideosShorten:
Explanations:
- Create External Table -> external means the Data does not reside in Athena but remains in S3.
- Row Format Serde -> state the SerDe (Serializer/Deserializer) that should be used in the reading of the Data
- Location -> the folder the files should be read from
- TBLProperties -> additional relevant properties
- The skip.header.line.count specify how many rows to skip when reading each file
At this point, you can query this “Table”:
** If you have a lot of Data in your Table, and you don’t want your boss to receive a large bill at the end of the month, write the ‘LIMIT’ keyword at the end of your query.
When you are ready to run the data query, omit the LIMIT clause.
Query JSONs
Because the data is semi-structured - this use case is a little more difficult
The Table for the Ingestion Level (MRR) YouTubeStatisctics.
- Download the attached JSON Files. Because the data is semi-structured - this use case is a little more difficult.
- The Table is for the Ingestion Level (MRR) and should be named – YouTubeStatisctics.
- Create the Folder in which you save the Files and upload both JSON Files.
- Follow the instructions from the first Post and create a table in Athena.
- After creating your table – make sure you see your table in the table list.
The Columns Items looks terrifying, but really it is not.
This JSON consists of one row.
However, the value of ‘item’ includes more complex types.
For example, one value of ‘item’ could be:
Notice:
- JSON are used as STRUCT
- List are used as ARRAY
Our example describes a STRUCT within a STRUCT.
That is exactly the schema written in items.
- Query the JSON:
Explanations:
- Here, we query the youtubestatistics table (which we defined earlier) and add the UNNEST(items) t(inr). This is like saying, take all my values in items and make each of them a row in a new table named inr. For example, to get the value ‘kind’ within the ‘items’, the syntax should be: inr.kind.
- the “$path” value is the path to our file. One of the values that returns to me is:
s3://rainbowdash/YouTubeStatistics/US_category_idTest.json
The url_extract_path("$path") value is the path inside our bucket. One of the values that returns to me is:
/YouTubeStatistics/US_category_idTest.json
The ‘substr’ function returns the ‘US’ string that I want.
And that’s it! You now know how to query JSON in Athena. Isn’t that amazing?
Creating a new table
- The new table we create will be named - YouTubeCategories.
- We will extract categories from the Json file.
- Run the following query:
We are going to use only the country, id and title of the categories. All other values are irrelevant for us.
- Insert rows to this Table using this Query:
Joining the Tables
Now we are going to JOIN both our youtubeCategories and the YouTubeVideoShorten and insert the unified Table into YouTubeFact, first, let’s create this table.
The Table for the DWH Level:
** I added the column ‘duration’ – the time in hours between publish_time and trending_date.
** I added the column ‘country’ because we have 2 files: one for the US and one for CA
and each of them belongs to another country (written in the File Name).
** I added the column ‘Category’ which is derived from JSON files.
Insert rows to the Fact Table
- In this step we will join the tables: YouTubeCategories and YouTubeVideosShorten.
- We need to Join these Tables by Category_Id and by Country.
- Here is an SQL that will answer it:
You should now be able to understand all of the intricate parts here.
Bonus:
- The Process I developed here is not best practice (the YouTubeShorten should be transformed before I joined it with YouTubeCategoies – can you tell why?).
- Try to develop an STG table for youtubeshorten before joining it to categories.
- Note that the Fact table schema shouldn’t be changed.
- Next sessions:
- Using Partitions
- Save data as columnar
- Why do we do these two things? Because they save money.