Voici un petit article regroupant le WebCast, les Slides et les quelques scripts HiveQL, commandes PowerShell, … présentés lors de notre session La démystification du Big Data aux TechDays 2014.
Les Slides
Les Retours
Nous avons eu de bons retours suite à notre présentation, les différents commentaires vont nous permettre d’améliorer certains points.
Au total 48 participants ont donné un avis sur la session, la note moyenne est de 3,9/5 ce qui nous place dans le milieu du classement 🙂
Les Scripts
HiveQL : Script d’agrégation, de transformation et de chargement des Tweets + Analyse des sentiments + Word Cloud
----------------------------------------------------------------------------------------------------------- -- Author : Romain Casteres -- Blog : https://pulsweb.azurewebsites.net/ -- Date : 12/02/2014 -- This Script PowerShell is used to create a new HDInsight Services Cluster -- TechDays 2014 ----------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------- -- HIVE - SQL-like language called HiveQL ----------------------------------------------------------------------------------------------------------- SET hive.exec.dynamic.partition=true; -- Allow dynamic partition SET hive.exec.dynamic.partition.mode=nonstrict; -- Allow dynamic partition -- Schema on Read / Schema on Write CREATE EXTERNAL TABLE IF NOT EXISTS KeywordList (PKID int, keywords string, SoftwareID int, PlatformID int, SocialNetwork string, Target string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location 'wasb://keywords@apvxdcube.blob.core.windows.net/'; CREATE VIEW Twitterkeyword AS SELECT PKID as IdKeywords, Keywords, SoftwareID, PlatformID, Target FROM Keywordlist WHERE SocialNetwork ='TWITTER'; -- Create Sentiment Dictionary (GZip File) CREATE EXTERNAL TABLE dictionary (word string, polarity float) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'wasb://hivescript@apvxdcube.blob.core.windows.net/dictionary'; -- Create Time Zone Map to aggregate the data by countries CREATE EXTERNAL TABLE time_zone_map (time_zone string,country string,notes string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'wasb://hivescript@apvxdcube.blob.core.windows.net/time_zone_map'; --Create the tweets_raw table containing the records from Twitter : 3 October 2013 CREATE EXTERNAL TABLE IF NOT EXISTS tweets_raw ( json_response string ) partitioned by (dt string) stored AS textfile; ALTER TABLE tweets_raw add if NOT EXISTS PARTITION (dt='2013-10-03') location 'wasb://twitterxbox@apvxdcube.blob.core.windows.net/2013-10-03'; CREATE VIEW tweets_simple AS SELECT get_json_object(json_response, '$.id_str') AS id, dt, substring(get_json_object(json_response, '$.created_at'),12,2) AS hour, regexp_replace(regexp_replace(regexp_replace(get_json_object(json_response, '$.text'),'\\t',''),'\\n',''),'\\r','') AS text, --regular expression cast (get_json_object(json_response, '$.retweet_count') AS int) AS retweet_count, get_json_object(json_response, '$.user.screen_name') AS screen_name, cast (get_json_object(json_response, '$.user.followers_count') AS int) AS followers_count, get_json_object(json_response, '$.user.verified') AS verified, get_json_object(json_response, '$.user.profile_image_url') AS profile_image_url, get_json_object(json_response, '$.user.time_zone') AS time_zone FROM tweets_raw WHERE (LENGTH(get_json_object(json_response, '$.text'))>1); -- Create Clean view and add country CREATE VIEW tweets_clean AS SELECT t.id, t.dt, t.hour, t.text, t.retweet_count, t.screen_name, t.followers_count, t.verified, t.profile_image_url, CASE WHEN LENGTH(m.country)5 THEN 'positive' WHEN AVG(polarity) tw.word) AND (LENGTH(tw.word) > 3) AND (word RLIKE '[a-zA-Z]') AND NOT(word RLIKE '[0-9]') --Word without numeric character HAVING (COUNT(tw.id) > 5); -- Output : External Table ext_tweets CREATE EXTERNAL TABLE IF NOT EXISTS ext_tweets (Keywords string, SoftwareID int, PlatformID int, Target string, Hour string, dt string, country string, verified boolean, sentiment string, AvgPolarity float, TweetCount bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 'wasb://outputdata@apvxdcube.blob.core.windows.net/Twitter/ext_tweets'; INSERT OVERWRITE TABLE ext_tweets SELECT k.Keywords, k.SoftwareID, k.PlatformID, k.Target, td.Hour, td.dt, td.country, td.verified, td.sentiment, AVG(td.AvgPolarity) AS AvgPolarity, COUNT(td.id) AS TweetCount FROM Twitterkeyword k INNER JOIN ext_tweets_details td ON lower(k.Keywords) = lower(td.Keywords) GROUP BY k.Keywords, k.SoftwareID, k.PlatformID, k.Target, td.Hour, td.dt, td.country, td.verified, td.sentiment; -- UDF : Add Java Functions -- Execute on cluster : hive -i C:\apps\dist\_MyScript\Twitter_tech2014.hql ----------------------------------------------------------------------------------------------------------- -- SQOOP : SQL To Hadoop ----------------------------------------------------------------------------------------------------------- sqoop export -connect "jdbc:sqlserver://dcubesql2.cloudapp.net:1433;username=###;password=###;database=APVX" -table stg_words -export-dir "wasb://outputdata@apvxdcube.blob.core.windows.net/Twitter/ext_keyword_word" -fields-terminated-by "\t" sqoop export -connect "jdbc:sqlserver://dcubesql2.cloudapp.net:1433;username=###;password=###;database=APVX" -table stg_tweets -export-dir "wasb://outputdata@apvxdcube.blob.core.windows.net/Twitter/ext_tweets" -fields-terminated-by "\t"
PowerShell : Copie du script HiveQL dans un Blob Storage Azure
Set-AzureSubscription -SubscriptionName '###' -CurrentStorageAccount '###' $context = New-AzureStorageContext -StorageAccountName '###' -StorageAccountKey '###' Set-AzureStorageBlobContent -Container '###' -File 'C:\Users\Administrateur\Desktop\test.txt' -Context $context -Force
PowerShell : Execution du script HiveQL :
Set-AzureSubscription -SubscriptionName '###' Write-Host "Executing HiveQL Jobs" -f yellow $clustername = "###" $clusterSubscriptionId = (Get-AzureSubscription -Current).SubscriptionId $clusterAdmin = '###' $clusterPassword = '###' $passwd = ConvertTo-SecureString $clusterPassword -AsPlainText -Force $clusterCredentials = New-Object System.Management.Automation.PSCredential ($clusterAdmin, $passwd) $ScriptFolder = "wasb://###/test.hql" $hiveJobVT = New-AzureHDInsightHiveJobDefinition -JobName "MyJobHQL" -File $ScriptFolder $startedHiveJob = $hiveJobVT | Start-AzureHDInsightJob -Credential $clusterCredentials -Cluster $clusterName $startedHiveJob | Wait-AzureHDInsightJob -Credential $clusterCredentials
PowerShell : Execution des commandes SQOOP
Set-AzureSubscription -SubscriptionName '###' Write-Host "Executing Sqoop Jobs" -f yellow $clustername = "###" $clusterSubscriptionId = (Get-AzureSubscription -Current).SubscriptionId $clusterAdmin = '###' $clusterPassword = '###' $passwd = ConvertTo-SecureString $clusterPassword -AsPlainText -Force $clusterCredentials = New-Object System.Management.Automation.PSCredential ($clusterAdmin, $passwd) $sqoop = New-AzureHDInsightSqoopJobDefinition -Command 'sqoop export -connect "jdbc:sqlserver://###.cloudapp.net:1433;username=###;password=###;database=###" -table test -export-dir "wasb://outputdata@###.blob.core.windows.net/Twitter/ext_keyword_trend" -fields-terminated-by "\t" ' $sqoopJob = $sqoop | Start-AzureHDInsightJob -Credential $clusterCredentials -Cluster $clusterName $sqoopJob | Wait-AzureHDInsightJob -Credential $clusterCredentials $sqoop = New-AzureHDInsightSqoopJobDefinition -Command 'sqoop export -connect "jdbc:sqlserver://###.cloudapp.net:1433;username=###;password=###;database=###" -table stg_tweets -export-dir "wasb://outputdata@###.blob.core.windows.net/Twitter/ext_tweets" -fields-terminated-by "\t"' $sqoopJob = $sqoop | Start-AzureHDInsightJob -Credential $clusterCredentials -Cluster $clusterName $sqoopJob | Wait-AzureHDInsightJob -Credential $clusterCredentials
SSIS : Architecture du package
Comments are closed.