Pandemic,Hot Indian Summers, Web scraping and Refrigerators

Summers and Pandemic

Just imagine – Hot Indian summers was about to start and due to Pandemic there was lockdown in the city- no free movement and mostly all retail business were closed and your wife broke this news that your 10 year old Refrigerator had stopped working – very gripping situation which one would not like to be in.

Now the tough decision is to get it repair or buy a new Refrigerator – why tough because due to slowdown & pandemic there was salary cut;now anyone can imagine my tough situation. But finally tough decision was taken jointly (By WIFE ONLY) that we have to buy a new Refrigerator..

By now, it was clear to me that I have to really make very smart & informed decision- so decided to test my newly learned skill – WEB SCRAPING using python.

So I scraped price of all Refrigerators listed on Amazon

So here is the glimpse of the data which was scraped from Amazon

## # A tibble: 536 x 14
##       sn   MRP discription                      offer_price Rating1               Brand  Type   litres Capcity  Star_Rating Technology   Doors  final_rating views
##    <int> <int> <chr>                                  <int> <chr>                 <chr>  <chr>   <dbl> <chr>    <chr>       <chr>        <chr>         <dbl> <int>
##  1     1 69999 AmazonBasics 468 L Frost Free S~       38999 "4.1 out of 5 stars\~ Amazo~ Frost~    468 More th~ No Star     ""           Side ~          4.1   893
##  2     2 18400 Haier 195 L 4 Star Direct-Cool ~       13440 "\n\n\n\n\n\n\n4.2 o~ Haier  Direc~    195 121L to~ 4 Star      ""           Singl~          4.2  1433
##  3     3 14990 Samsung 192 L 2 Star Direct Coo~       12290 "\n\n\n\n\n\n\n4.3 o~ Samsu~ Direc~    192 121L to~ 2 Star      ""           Singl~          4.3   634
##  4     4    NA Whirlpool 190 L 3 Star Direct-C~          NA "\n\n\n\n\n\n\n4.2 o~ Whirl~ Direc~    190 121L to~ 3 Star      ""           Singl~          4.2  3673
##  5     5 11500 Haier 53 L 2 Star Direct-Cool S~        8900 "\n\n\n\n\n\n\n4.0 o~ Haier  Direc~     53 Less th~ 2 Star      ""           Singl~          4    1960
##  6     6 28250 Whirlpool 265 L 3 Star Inverter~       24990 "\n\n\n\n\n\n\n4.1 o~ Whirl~ Frost~    265 231L to~ 3 Star      "Inverter T~ Doubl~          4.1  1107
##  7     7 28990 Samsung 253 L 3 Star with Inver~       23090 "\n\n\n\n\n\n\n4.2 o~ Samsu~ Frost~    253 231L to~ 3 Star      "Inverter T~ Doubl~          4.2  1203
##  8     8 18990 Samsung 230 L 3 Star Inverter D~       15890 "\n\n\n\n\n\n\n4.1 o~ Samsu~ Direc~    230 201L to~ 3 Star      "Inverter T~ Singl~          4.1   414
##  9     9 30690 LG 260 L 3 Star Frost Free Doub~       25290 "\n\n\n\n\n\n\n4.4 o~ LG     Frost~    260 231L to~ 3 Star      "Inverter T~ Doubl~          4.4  2809
## 10    10 20990 Samsung 198 L 5 Star Inverter D~       17790 "\n\n\n\n\n\n\n4.4 o~ Samsu~ Direc~    198 121L to~ 5 Star      "Inverter T~ Singl~          4.4   952
## # ... with 526 more rows

In total 536 SKUs were scraped but found some retailer has wrongly latched their products (like Refrigerator mat, cover) to this category – Amazon need to take care of this. This data definitely required some cleaning but before that, I thought to look SKU listed for each Brand.

For each SKU we have MRP (INR), offer price, rating and its small description from which we can drive Brand name, Capacity in Litres, type of technology (Direct cool vs Frost free), Number of doors,Star rating, etc

data %>% drop_na(Brand) %>% filter (Brand !=""| Brand !="NA")  %>% group_by(Brand) %>% tally() %>% arrange(desc(n)) 
## # A tibble: 20 x 2
##    Brand              n
##    <chr>          <int>
##  1 "Samsung"        129
##  2 "LG"             116
##  3 "Whirlpool"       93
##  4 "Haier"           64
##  5 "Godrej"          55
##  6 "Panasonic "      20
##  7 "Voltas"          11
##  8 "LEONARD"          7
##  9 "Bosch"            5
## 10 "Blue Star"        4
## 11 "Rockwell"         4
## 12 "Gem"              2
## 13 "Onida "           2
## 14 "AmazonBasics"     1
## 15 "Hisense"          1
## 16 "Koryo"            1
## 17 "Llody "           1
## 18 "LLOYD"            1
## 19 "Siemens"          1
## 20 "Videocon"         1

Some key observation:

  • As expected, the top 3 brands listed on the Amazon were Samsung (25%), LG (22%) and Whirlpool (17%)
  • Interestingly, Haier had good presence online also (account 12%)
  • However, some old Indian player, like Godrej & Voltas, either this sales channel is not that important or they really have very few SKUs in their product portfolio. Anyways these 2 brands were not part of my consideration set – as their Brand salience is not that great (Haven’t seen their any ad – TVS/Print/even digital off late).

Now Question is between Direct Cool vs Frost Free

Type Samsung LG Whirlpool Haier Godrej Panasonic Voltas LEONARD
Direct Cool 57 50 48 32 39 4 3 0
Freezer 0 1 0 0 1 0 3 0
Frost Free 72 64 45 32 15 16 5 0
Mini Freeze 0 1 0 0 0 0 0 5
Water Dispenser 0 0 0 0 0 0 0 2
Total 129 116 93 64 55 20 11 7

Both Brands- Samsung & LG had slightly more number of Frost Free SKUs whereas Whirlpool has slightly more Direct cool refrigerator listed on Amazon.

One of the inputs I got from my Home ministry that this time we should buy a premium refrigerator. So decided to look for SKUs which are at 80 percentile

So it is important to know what price represent 80 percentile and which is :-

##   80% 
## 34518

But it make more sense to know 80 percentile by type of refrigerator. So let look 1st at the Direct cool refrigerators

##     80% 
## 18493.6

Now by Frost free refrigerators

##   80% 
## 49426

Now let us see what all options are available for both type if I want to buy premium refrigerator

Direct Cool

data %>% drop_na(Brand) %>% filter (Brand !=""| Brand !="NA")  %>% drop_na(offer_price)%>% filter (offer_price !=""| offer_price !="NA")%>% filter (Type=="Direct Cool") %>% filter (offer_price > 18500) %>% group_by(Brand) %>% tally() %>% arrange(desc(n)) 
## # A tibble: 5 x 2
##   Brand         n
##   <chr>     <int>
## 1 LG           12
## 2 Samsung      11
## 3 Whirlpool     5
## 4 Haier         2
## 5 Godrej        1

Very interesting – for the premium range SKUs in Direct cool refrigerator Samsung accounts for the ~40% of the SKUs

Frost Free

data %>% drop_na(Brand) %>% filter (Brand !=""| Brand !="NA")  %>% drop_na(offer_price)%>% filter (offer_price !=""| offer_price !="NA")%>% filter (Type=="Frost Free") %>% filter (offer_price > 49000) %>% group_by(Brand) %>% tally() %>% arrange(desc(n)) 
## # A tibble: 6 x 2
##   Brand            n
##   <chr>        <int>
## 1 "LG"            17
## 2 "Samsung"       15
## 3 "Haier"          4
## 4 "Panasonic "     4
## 5 "Bosch"          3
## 6 "Siemens"        1

43 SKUs – mostly LG and Samsung…..Whirlpool is missing

But when I looked all these SKUs – I found that many of these were side by side refrigerator – which is diffcult to adjust in metro’s Flats , as they have space constraint – I was clear that we have to buy either 2 door or 3 door refrigerator but not side by side.

So now I have to filter these shortlisted SKUs bases which are not side by side.

library(kableExtra)
d1<- data %>% drop_na(Brand) %>% filter (Brand !=""| Brand !="NA")  %>% drop_na(offer_price)%>% filter (offer_price !=""| offer_price !="NA")%>% filter (Type=="Frost Free") %>% filter (offer_price > 49500) %>% select("sn","MRP","discription","offer_price","litres","Technology","Doors","final_rating","views")%>% filter (Doors != "Side by Side") 

kable (d1)  %>%
  kable_styling()
sn MRP discription offer_price litres Technology Doors final_rating views
77 89000 Samsung 670 L 2 Star ( 2019 ) Frost Free Double Door Refrigerator(RT65K7058BS/TL, Black inox, Convertible, Inverter Compressor) 67900 670 Inverter Technology Double Door 3.8 20
102 73000 Samsung 523 L 2 Star ( 2019 ) Frost Free Double Door Refrigerator(RT54K6558SL/TL, Silver) 57390 523 Double Door 4.0 27
163 79500 Bosch 559 L 2 Star Inverter Frost Free Double Door Refrigerator (Series 6 KGN56LB41I, Black, Bottom Freezer) 65890 559 Inverter Technology Double Door 4.3 26
184 87990 LG 630 L 3 Star Inverter Frost-Free Double Door Refrigerator (GR-H812HLHQ, Shiny Steel) 72580 630 Inverter Technology Double Door 5.0 1
204 69090 LG 471 L 2 Star LG ThinQ(Wi-Fi) Inverter Linear Frost-Free Double-Door Refrigerator (GL-T502FPZU, Shiny Steel, Convertible) 56900 471 Inverter Technology Double Door 3.9 18
222 112000 Panasonic Econavi 601 L 6-Stage Inverter Frost-Free Multi-Door Refrigerator (NR-DZ600GKXZ, Black Glass, Powered by Artificial Intelligence) 88980 601 Inverter Technology 4.5 8
256 63990 Samsung 407 L 3 Star Inverter Frost-Free Double Door Refrigerator (RT42T5C5EDX/TL, Luxe Brown) 49690 407 Inverter Technology Double Door NA NA
266 105000 Panasonic Econavi 551 L 6-Stage Inverter Frost-Free Multi-Door Refrigerator (NR-CY550GKXZ, Black Glass, Powered by Artificial Intelligence) 88000 551 Inverter Technology Double Door 4.7 8
273 59200 Samsung 415 L 3 Star Frost Free Double Door Refrigerator(RT42M553ESL/TL, Real Stainless, Convertible, Inverter Compressor) 53999 415 Inverter Technology Double Door 3.3 10
274 76990 Samsung 551 L 2 Star Inverter Frost-Free Double Door Refrigerator (RT56T6378BS/TL, Black Inox) 65490 551 Inverter Technology Double Door 3.6 4
277 72990 LG 471 L 3 Star LG ThinQ(Wi-Fi) Inverter Linear Frost-Free Double-Door Refrigerator (GL-T502FASN, Amber Steel, Convertible) 60656 471 Inverter Technology Double Door 4.3 41
321 67690 LG 437 L 3 Star LG ThinQ(Wi-Fi) Inverter Linear Frost-Free Double-Door Refrigerator (GL-T432FASN, Amber Steel, Convertible) 55041 437 Inverter Technology Double Door 4.2 50
329 67750 Samsung 478 L 2 Star (2019) Frost Free Double Door Refrigerator(RT49K6338BS/TL, Black inox, Convertible, Inverter Compressor) 53800 478 Inverter Technology Double Door 3.6 21
364 71890 LG 471 L 2 Star Inverter Linear Frost-Free Double Door Refrigerator (GL-T502FRS2, Russet Sheen, Wi-Fi | Convertible) 60335 471 Inverter Technology Double Door 5.0 1
415 73190 LG 471 L 3 Star Inverter Frost-Free Double Door Refrigerator (GL-T502FES3, Ebony Sheen) 58320 471 Inverter Technology Double Door 5.0 2

Now these 15 SKUs- I short listed 3 SKUs 1 Samsung SKU and 2 LG SKUs

  • “Samsung 523 L 2 Star ( 2019 ) Frost Free Double Door Refrigerator(RT54K6558SL/TL, Silver)” with 4 rating at offer price 57390 and of capcity 523 litres
  • “LG 471 L 3 Star LG ThinQ(Wi-Fi) Inverter Linear Frost-Free Double-Door Refrigerator (GL-T502FASN, Amber Steel, Convertible)” with 4.3 rating at offer price 60656 and of capcity 471 litres
  • “LG 437 L 3 Star LG ThinQ(Wi-Fi) Inverter Linear Frost-Free Double-Door Refrigerator (GL-T432FASN, Amber Steel, Convertible)” with 4.1 rating at offer price 55041 and of capcity 437 litres

When I showed all 3 refrigerators to my wife – the 1st thing she asked about the reviews, online reviews –> How consumers have reacted.

So I decided to download the all the reviews of these 3 SKUs and then tried to take more informed decision (which I will cover in my next blog.)

Any one who want to play with this data can download from here Data

Petrol & Politics

I am a strong believer that “any information without the data is just an opinion” which may or may not be true.

Recently we all have been hearing and reading various leaders opinion about the increase in the fuel price and present government failure to control it and lots of comparisons were made with the previous governments (esp. UPA 1 & UPA 2).

Although I have very less information about how dynamic petrol pricing works but these debates made to look more for the data & information to form my opinion (esp. Petrol prices) and how it has moved over the period of time.

Petrol Price Build up

IOCL also provided the historical data for the petrol and diesel.

I took this data plotted on Power BI dashboard and observed that for Delhi after UPA 1 term got over there was an increase in Petrol price by 25% and after UPA 2 term got over the increase in petrol price was whopping 60% and under current government period till date (11/09/2018) – there is ONLY an increase of 13% in the last 4 years even after internal national petrol price per barrel has gone up and $ has appreciated.

Here is the Power BI Dashboard – use this data to form your opinion.

 

Visualization & Data Exploration

In any data science project, once data cleaning is over, the most important step is creating the hypothesis and spending time with the data.

Presently a lot of drag and drop kind of software are available which help in data exploration and creating the hypothesis.

Fortunately, I have worked on the 2 most powerful such kind of tools available in the market – Tableau & Power BI.

Here is the 1st attempt to publish my learning in Power BI

The below is the link of the 1st report I created in Power BI (Again the data source is RBI – Transaction data)

Digital Money

Digitalization, India, New Products & Forecasting

Digitalization and Demonetization

India being the land of culture, land of opportunites and land of 130 cr consumers.
Over the years, India is not able to grow to its actual potential; major roadbloack was and is corruption.

This present Indian government’s very unpopular move of Demonetization aimed to curb corruption and push India towards Digitalization.

Although demonetization created lot of havoc and problems but it also created opportunities for the new product/s.

I don’t know whether it curbed the corruption or not but I can see a clear push towards digital banking echo system and new products.


plot of chunk unnamed-chunk-1


We can clearly observe during & after demonetization exponential jump of the usage of plastic money at the point of sale (POS). In December’2016, debt cards usage at POS is almost 4 times that of October usage.
We all expected that this increase will ease it out once situation become normal & new currency is easily available, even traders in India don’t encourage to take payments through plastic money as these transactions comes under income tax.

Though the situation had become normal now, usage of the plastic money at POS is still almost twice as that of before demonetization.


Mobile Money

Demonetization also created an environment where mobile cash can grow many folds and here mobile wallets and newly launched Unified payment Interface got the perfect opportunity to grow.

In October 2016, before demonetization, Mobile wallet + UPI together accounts for nearly 100 mn transactions which has now grown to 469 mn transactions in April 2018 and as per latest figures from RBI, alone UPI transactions in the month of June’18 accounts is little more than 400 mn transactions.

plot of chunk unnamed-chunk-2

Trend is clearly visible – India is slowly moving towards cashless economy.
But the bigger question of an hour is do we have such infrastructure ready for such kind of transition?

Forecast for the POS machines

Here is the small attempt to predict/forecast the number POS machines required for the next 1 year.
Data source is again RBI.

Before modeling the dataset, need to follow below guidelines (Courtsey Learnings from Time series Analysis @ Coursera)

Guidelines for modeling the time series

  • Trend suggests differencing
  • Variation in variance suggests transformation
  • Common transformation: log, then differencing -It is also known as log-return
  • ACF suggests order of moving average process (q)
  • PACF suggests order of autoregressive process (p)
  • Akaike Information Criterion (AIC)
  • Sum of squared errors (SSE)
  • Ljung-Box Q-statistics
  • Estimation!
library(tidyverse)
library(forecast)
library(astsa)
library(knitr)

load(file="pos_forecasting.rda")
data$Date <- as.Date(data$Date, format= "%d-%m-%Y") 

#top 5 rows of the data

kable (head(data,5))
Bank_Name Bank_Type Date Number_of_POS
Allahabad Bank Nationalised banks 2011-04-01 0
Andhra Bank Nationalised banks 2011-04-01 2113
Bank of Baroda Nationalised banks 2011-04-01 4537
Bank of India Nationalised banks 2011-04-01 2581
Bank of Maharashtra Nationalised banks 2011-04-01 481
pos_data<-aggregate(data$Number_of_POS, by=list(data$Date), sum)

names (pos_data)<- c("Date","POS")

#converting the data in ts format

ts_pos<- ts(pos_data[,-1],start = c(2011,4),end = c(2018,5), frequency = 12)

Ploting the data to figure out the trend

plot (ts_pos, main="Monthly POS machines deployed", ylab="Number of POS Machines deployed")

plot of chunk unnamed-chunk-4

Clearly we can see their is trend in the data & series is not Stationay.

We need to remove the trend.

Lets first check is there any auto- correlation between different lags of the time series.
For this we will use Box Test.

Box.test (ts_pos,lag=log(length(ts_pos)))
Box-Pierce test

data: ts_pos
X-squared = 273.29, df = 4.4543, p-value < 2.2e-16
The p- value is very small – which indicates there is definately auto-correaltion.

Now the next step is to remove the trend, I will use difference operator (diff)

plot (diff(ts_pos),main="Diferenced data of 1st order", ylab="Number of POS Machines deployed")

plot of chunk unnamed-chunk-6

Box.test (diff(ts_pos),lag=log(length(diff(ts_pos))))
Box-Pierce test

data: diff(ts_pos)
X-squared = 81.76, df = 4.4427, p-value < 2.2e-16

Still we can see the trend in the data. Again we have to apply difference operator

plot (diff(diff(ts_pos)),main="Diferenced data of order 2", ylab="Number of POS Machines deployed")

plot of chunk unnamed-chunk-7

Box.test (diff(diff(ts_pos)),lag=log(length(diff(diff(ts_pos)))))
Box-Pierce test

data: diff(diff(ts_pos))
X-squared = 15.703, df = 4.4308, p-value = 0.004967

Now let’s look at the ACF & PACF which also suggest auto-correaltion in the data
acf (diff(diff(ts_pos)), main="ACF of Differenced data")

plot of chunk unnamed-chunk-8

pacf(diff(diff(ts_pos)), main='PACF of differenced data')

plot of chunk unnamed-chunk-8

  • ACF – There is significant lag at lag 1 and 1 pick at lag 14 (may be we can ignore this and refer this as randomness).

  • PACF – There is significant lag at lag 1 (for autoregressive process)

Now let try different models

It is clear, our d=2 (we have performed difference twice)
Now let’s figure out other parameter of ARIMA model

diff.model<- c()
mod1<- c()
d=2;z=1
for(p in 1:4){
  for(q in 1:4){
          if(p+d+q<=10){
          model<-arima(x=ts_pos, order = c((p-1),d,(q-1)))
          pval<-Box.test(model$residuals, lag=log(length(model$residuals)))
          sse<-sum(model$residuals^2)
          mod1<- c(model$aic,sse,pval$p.value)
         diff.model<- cbind(diff.model,mod1)

        }
      }
    }
colnames(diff.model)<- c ("Arima 0,2,0","Arima 0,2,1","Arima 0,2,2","Arima 0,2,3","Arima 1,2,0","Arima 1,2,1","Arima 1,2,2","Arima 1,2,3","Arima 2,2,0","Arima 2,2,1","Arima 2,2,2","Arima 2,2,3","Arima 3,2,0","Arima 3,2,1","Arima 3,2,2","Arima 3,2,3")
rownames (diff.model)=c('AIC', 'SSE', 'p-value')

kable (diff.model)
Arima 0,2,0 Arima 0,2,1 Arima 0,2,2 Arima 0,2,3 Arima 1,2,0 Arima 1,2,1 Arima 1,2,2 Arima 1,2,3 Arima 2,2,0 Arima 2,2,1 Arima 2,2,2 Arima 2,2,3 Arima 3,2,0 Arima 3,2,1 Arima 3,2,2 Arima 3,2,3
AIC 2.04390e+03 2.032641e+03 2.034152e+03 2.034000e+03 2.031575e+03 2.033573e+03 2.034886e+03 2.030729e+03 2.033574e+03 2.029861e+03 2.031232e+03 2.032559e+03 2.035330e+03 2.037349e+03 2.032914e+03 2.034753e+03
SSE 1.77329e+11 1.511417e+11 1.502817e+11 1.460151e+11 1.492217e+11 1.492186e+11 1.478676e+11 1.360945e+11 1.492195e+11 1.379507e+11 1.369194e+11 1.357870e+11 1.487718e+11 1.488064e+11 1.363015e+11 1.360618e+11
p-value 4.31830e-03 8.042071e-01 9.096711e-01 9.588175e-01 9.737261e-01 9.731882e-01 9.171468e-01 9.993006e-01 9.733108e-01 9.898481e-01 9.964327e-01 9.749469e-01 9.756955e-01 9.678883e-01 9.879784e-01 9.919547e-01
From the above we can conclude that
  • Model with least AIC is Arima 2,2,1.
  • Second best model is ARIMA 1,2,3
  • Third best model is ARIMA 1,2,0

Now the choice is between these 3 models.

Although there is a function in forecast library which automatically generate the parameter for least error.

library(forecast)
auto.arima(ts_pos)

Series: ts_pos
ARIMA(2,2,1)

Coefficients:
ar1 ar2 ma1
0.4638 0.2614 -0.9687
s.e. 0.1141 0.1113 0.0465

sigma2 estimated as 1.703e+09: log likelihood=-1010.93
AIC=2029.86 AICc=2030.37 BIC=2039.58
According to the above function also best model is ARIMA 2,2,1.

Now we will be using these parameters in our final model

library(astsa)

sarima(ts_pos,2,2,1,0,0,0)

initial value 10.747222
iter 2 value 10.726566
iter 3 value 10.663160
iter 4 value 10.660967
iter 5 value 10.660754
iter 6 value 10.660753
iter 7 value 10.660729
iter 8 value 10.660648
iter 9 value 10.660585
iter 10 value 10.660574
iter 11 value 10.660567
iter 12 value 10.660528
iter 13 value 10.660371
iter 14 value 10.660275
iter 15 value 10.660237
iter 16 value 10.660214
iter 17 value 10.659064
iter 18 value 10.658786
iter 19 value 10.658386
iter 20 value 10.657945
iter 21 value 10.652212
iter 22 value 10.649269
iter 23 value 10.638113
iter 24 value 10.633268
iter 25 value 10.627282
iter 26 value 10.625633
iter 27 value 10.623322
iter 28 value 10.621746
iter 29 value 10.621737
iter 30 value 10.621721
iter 30 value 10.621721
final value 10.621721
converged
initial value 10.616198
iter 2 value 10.616101
iter 3 value 10.615955
iter 4 value 10.615953
iter 5 value 10.615951
iter 6 value 10.615951
iter 6 value 10.615950
iter 6 value 10.615950
final value 10.615950
converged
plot of chunk unnamed-chunk-11$fit

Call:
stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc,
REPORT = 1, reltol = tol))

Coefficients:
ar1 ar2 ma1
0.4638 0.2614 -0.9687
s.e. 0.1141 0.1113 0.0465

sigma2 estimated as 1.642e+09: log likelihood = -1010.93, aic = 2029.86

$degrees_of_freedom
[1] 81

$ttable
Estimate SE t.value p.value
ar1 0.4638 0.1141 4.0652 0.0001
ar2 0.2614 0.1113 2.3494 0.0212
ma1 -0.9687 0.0465 -20.8369 0.0000

$AIC
[1] 22.28911

$AICc
[1] 22.31811

$BIC
[1] 21.37472

Looking at the residual plots, we can conculde

  • There is no significant auto-correlation (ACF of Residual).
  • There is no small p value – which tell us there is no signifcant auto -correlation

We can assume at this point residuals are the white noise.

So the final model will be

model<-arima(x=ts_pos,order = c(1,2,1))

plot(forecast(model))

plot of chunk unnamed-chunk-12

Forecast for the next 12 months

pos_predict<- forecast(model,12)
kable (pos_predict)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jun 2018 3322859 3268845 3376873 3240252 3405467
Jul 2018 3390114 3288152 3492077 3234177 3546052
Aug 2018 3460657 3297580 3623734 3211252 3710061
Sep 2018 3529929 3298050 3761808 3175300 3884558
Oct 2018 3599692 3291072 3908312 3127698 4071685
Nov 2018 3669265 3277079 4061451 3069469 4269061
Dec 2018 3738912 3256727 4221096 3001474 4476349
Jan 2019 3808530 3230390 4386669 2924342 4692718
Feb 2019 3878159 3198441 4557877 2838621 4917697
Mar 2019 3947784 3161166 4734401 2744756 5150812
Apr 2019 4017410 3118822 4915999 2643137 5391684
May 2019 4087036 3071626 5102447 2534100 5639973

4 mn POS machine will be required to deploy in May 2019.

In May 2016 – big 4 banks (SBI, HDFC, ICICI & Axis Bank) accounts for 75% of POS machine deployed. The same 4 banks accounts for 67% in May 2017 and 57% in May 2018. New players like RBL, corporation Banks getting very aggresive in deploying POS machines.

WordCloud BY Sentiment

The basic aim of this post is to able to chart word cloud by positive and negative sentiments
Here I will be using famous case of “ American Consumer Satisfaction Index web site.” (https://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/)to figure out the sentiment score

The text data has been fetched from twitter using twitteR package for key word “Dish TV”
The tweets were collected from 2014/04/25 to 2014/05/01
Here I will be not describing how to collect the tweets

df<- load ("dish.Rdata")
dish<- as.data.frame(data.df3)
head (dish$text)
## [1] "#OTeri what a film showing now @DishTV_India"                                                                                                    
## [2] "@DishTV_India @StarSportsIndia  Dish list bothered about their customers.  We have beeen requesting to ss2 ss hd1 &amp; sshd2 todish tv subscrib"
## [3] "Appeal to @DishTV_India to block @timesnow dont want this rotten apple for us."                                                                  
## [4] "Kings XI Punjab team reaches\nIndia for the Second Leg of #IPL\n\nALL PLYEAR SWIMING \n\n#A_A_J http://t.co/TdnsVwcRCK"                          
## [5] "Today No Match In IPL 7 2014\nRest Day.\nTomorrow IPL 7 2014:\n2nd Round Will Start in India:\nCSK _vs_ KK\nWill In action 07:3OPM \nA.s"        
## [6] "RT @HBOHITS: All men must die...except the one who shall rule the realm! Express your desire and your favourite #GOTCharcter. #TakeTheThrone"

Now writing the cleaning function to clean the tweets

clean.text = function(x)
{
   # tolower
   x = tolower(x)
   # remove rt
   x = gsub("rt", "", x)
   # remove at
   x = gsub("@\\w+", "", x)
   # remove punctuation
   x = gsub("[[:punct:]]", "", x)
   # remove numbers
   x = gsub("[[:digit:]]", "", x)
   # remove links http
   x = gsub("http\\w+", "", x)
   # remove tabs
   x = gsub("[ |\t]{2,}", "", x)
   # remove blank spaces at the beginning
   x = gsub("^ ", "", x)
   # remove blank spaces at the end
   x = gsub(" $", "", x)
   return(x)
}

Cleaning the tweets first

dish$text<- clean.text (dish$text)
head (dish$text)
## [1] "oteri what a film showing now"                                                                                           
## [2] "dish list bothered about their customerswe have beeen requesting to ss ss hd amp sshd todish tv subscrib"                
## [3] "appeal toto blockdont want this rotten apple for us"                                                                     
## [4] "kings xi punjab team reaches\nindia for the second leg of ipl\n\nall plyear swiming \n\naaj"                             
## [5] "today no match in ipl\nrest day\ntomorrow ipl\nnd round will sta in india\ncsk vs kk\nwill in action opm \nas"           
## [6] "all men must dieexcept the one who shall rule the realm express your desire and your favourite gotcharcter takethethrone"

Using Sentiment function as discribed by jeffreybreen

score.sentiment = function(sentences, pos.words, neg.words)
{
  require(plyr)
  require(stringr)

  # we got a vector of sentences. plyr will handle a list
  # or a vector as an "l" for us
  # we want a simple array ("a") of scores back, so we use 
  # "l" + "a" + "ply" = "laply":
  scores = laply(sentences, function(sentence, pos.words, neg.words) {

    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    sentence = tolower(sentence)

    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)

    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)

    # match() returns the position of the matched term or NA
    # we just want a TRUE/FALSE:
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)

    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)

    return(score)
  }, pos.words, neg.words)

  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
}

Now providing the input to the function

hu.liu.pos=scan('positive-words.txt',what='character',comment.char =";")
hu.liu.neg=scan('negative-words.txt',what='character',comment.char =";")
dish$text <- as.factor (dish$text)
# adding few more terms to both positive and negative dictionary
pos.words = c(hu.liu.pos, 'upgrade','thanks','thnx')
neg.words = c(hu.liu.neg, 'wtf', 'wait','waiting', 'epicfail', 'mechanical', 'problem','noresponse','nothing', 'cheated','robbery','fake','drops','awful','disconnected','dropped', 'looting','network issues',"issue", "never","refund", "none","lazy","difficulties","inconvenience", "not working", "wrong", "cheated", "not satisfied", "none", "not reachable","irregularities", "does not have","Not happy","does not","improve", "should'nt","disconnect", "communication gap", "didnot", "did not", "not clear","problem", "expensive","unfair","not working","delay", "not very responsive", "unused", "lousy experience","dupped","lost","sorry","cheated","insanely expensive","expensive","not explain", "poor","not activated","bad experience","not at all working","waste","did not work","no one helping","chasing","never","horrible","hidden","cheaing","ripped","nt wrked","non","expensive","irritating","didn't appreciate","PATHETIC","pathetic","stopped working","Loss","nor","no support","refunded","Not worked","barred","Overcharged","Idiots" )
dish.score = score.sentiment(dish$text,pos.words,neg.words)

head (dish.score)
##   score
## 1     0
## 2    -1
## 3     0
## 4     0
## 5     0
## 6     0
##                                                                                                                       text
## 1                                                                                            oteri what a film showing now
## 2                 dish list bothered about their customerswe have beeen requesting to ss ss hd amp sshd todish tv subscrib
## 3                                                                      appeal toto blockdont want this rotten apple for us
## 4                              kings xi punjab team reaches\nindia for the second leg of ipl\n\nall plyear swiming \n\naaj
## 5            today no match in ipl\nrest day\ntomorrow ipl\nnd round will sta in india\ncsk vs kk\nwill in action opm \nas
## 6 all men must dieexcept the one who shall rule the realm express your desire and your favourite gotcharcter takethethrone

Now we have sentiment scores against each twitter comments.Now divide the data by positive and Negative sentiments

df.negaitive<-subset(dish.score, score < 0)
df.positive<-subset(dish.score, score > 0)
df.negaitive.text<- df.negaitive$text
df.positive.text<- df.positive$text

Removing stop-words from both positive and negative tweets

library(tm)
tdm_positive = Corpus(VectorSource(df.positive.text))
tdm_negative = Corpus(VectorSource(df.negaitive.text))
tdm_positive = TermDocumentMatrix(tdm_positive, control = list(removePunctuation = TRUE,stopwords = c("dish", "dishtwicket","dishtvindia", "dishtv", "na", "amp", "mivscsk", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))

tdm_negative = TermDocumentMatrix(tdm_negative, control = list(removePunctuation = TRUE,stopwords = c("dish", "dishtvindia", "dishtwicket","dishtv", "na", "amp", "mivscsk", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))

Creating Positive Word-Cloud

library(wordcloud)
library(RColorBrewer)
#we define tdm as matrix
m = as.matrix(tdm_positive) 

#now we get the word orders in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE) 

#we create our data set
dm = data.frame(word=names(word_freqs), freq=word_freqs)

wordcloud(dm$word, dm$freq, min.freq=2,max.words=80,random.order=T, colors=brewer.pal(8, "Dark2")) 

plot of chunk unnamed-chunk-8

Creating Negative Word-Cloud

library(wordcloud)
library(RColorBrewer)
#we define tdm as matrix
m = as.matrix(tdm_negative) 

#now we get the word orders in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE) 

#we create our data set
dm = data.frame(word=names(word_freqs), freq=word_freqs)

wordcloud(dm$word, dm$freq, min.freq=2,max.words=80,random.order=T, colors=brewer.pal(8, "Dark2")) 

plot of chunk unnamed-chunk-9

First Shiny App- Correspondence Map App

As I market researcher, we have to deal with brand perception data frequently. And one of the way of analyzing this data is through “correspondence map” (perceptual maps). And in analyzing this data in SPSS we have to follow tedious process and format this data in particular format. I always want to have tool where I can upload the data and map is drawn. Now with the help of shiny, I have created an app where user can upload the data (in csv format where brands are column and in rows imagery statements- Basically it is brand association data for various imagery statements) and output is correspondence map

https://kapoorabhishek.shinyapps.io/PerceptualMap/

 

Analysis of Tweets Collected for twitter Accounts – @Nissan_India & @Renault_India during Auto Expo 2014

In my earlier post I analysed tweets collected for key word #autoexpo2014, now in this post we are going to analysis tweets collected for two automotive manufacturers – @Nissan_India & @Renault_India

Tweets collected during 7 days of Auto Expo 2014

Compared to Renault, Japanese multinational automotive manufacturer was highly active on twitter during AutoExpo. Total tweets collected for Nissan_India= 1726 & Reanult_India= 302

tweets per day

tweets per day_renault_India

Devices were used for tweets/RT/Reply

For Renault_India – mostly tweets are done from static device – as web amount nearly 70% of devices used to tweet, whereas for @Nissan_India web only amounts 56% and Apple products (iPhone/iPad) amounts to 17% which is surprisingly more than Android devices (8.4%)

Device used for tweets

Device used for tweets_Renault_India

From where people are tweeting – Location

For Nissan_India – more than 1/4th Tweets are coming from their Headquarter town (Chennai)

location from where tweeted

location_Renault_India

If you have any questions feel free to ask or follow me on Twitter to get the newest updates about analytics with R and analytics of Social Data.

Analysis of devices used to post on Twitter – R code

Hi Everybody,

After collecting twitter data (JULIANHI), then interesting analysis can be done on the devices used to tweets

df = searchTwitter("@Nissan_India")

Now Analyzing the Source data

Now cleaning the source data

d <- df$source
    d <- gsub("  ", " ", d)
    d <- gsub("Twitter for", "", d)
    d <- gsub("</a>", "", d)
    d <- strsplit(d, ">")
    d <- sapply(d, function(x) ifelse(length(x) > 1, x[2], x[1]))
    t <- sort(table(d), decreasing=TRUE)
top<- 10 ##top 10 devices
if( top < length(t))
        t <- t[1:top]

library (arrayhelpers)

b1 <- array2df (t)
colnames (b1)<- c("Value", "Devices")

rr1<- sum(b1$Value) ###sum of devices mentioned 

Now plotting them using ggplot

library (ggplot2)
ce <- ddply(b1, "Devices", transform,percent_weight = Value / sum(b1$Value) * 100)
##plotting the devices
p1 <- ggplot (ce, aes (x=Devices, y=percent_weight, fill =Devices))+geom_bar(stat="identity")+ geom_text(aes(label=sprintf("%1.1f",percent_weight)), hjust=-0.05, colour="black", size=3)+coord_flip()+ ggtitle(paste("@Nissan_India: Sum of total devices=",rr1))

print (p1)

interesting output is what we get now

Device used for tweets

Social media analysis – Tweets collected during autoexpo 2014

1. Number of tweets collected per day

Image

In total 4245 tweets collect in a span of a week (6 days)

2. What devices were used for tweets/RT/Reply

Image

People in India are shifting more on mobile – Web only accounts for 41% but interestingly iPhone is in 3rd position with 12%.

Windows phone still not in top 10 devices

3. Who re-tweeted most

Image

As expected, media re-tweeted most – httweets and IndiaToday retweeted most

4. Location from where people have retweeted, tweeted and replied 

Image

1/3rd of the people have mentioned India as their location and close to 30% have mention their location as NCR

5. Top users who tweeted most

Image

meg_sharma007 with 101 tweets, AutoExpo2014 with 81 tweets and jaidayal with 60 tweets were top 3 users who tweeted the most

6. What was tweeted – word cloud

Image