From StatWiki
Jump to: navigation, search

Partial Least Squares is another method for testing causal models (in addition to Covariance Based Methods used in AMOS and Lisrel). This section on PLS is intended to demystify the process of conducting an analysis, start to finish, using PLS-graph. Trust me, it needs demystification! I am not going to get into the deep logic and math behind the methods I outline here. This wiki is simply intended to be used as a "How To" for pls-graph. For more references and technical explanations, please refer to Wynne Chin's website: I have also listed several videos for SmartPLS 2.0 at the bottom of this page, and an article about when to choose PLS and how to use it.

Citation.pngDo you know of some citations that could be used to support the topics and procedures discussed in this section? Please email them to me with the name of the section, procedure, or subsection that they support. Thanks!

Installing PLS-graph

Simply installing PLS-graph is a complex process. Instructions for getting the full version can be found here: Or you can click on this link: for the demo version. The demo version is rather limiting, and only allows you to test models with ten variables or less. In order to obtain a more useful license, you will have to contact Wynne Chin directly: Sadly I am not allowed to distribute it freely, and requests directed toward me for the full license must be rejected.

  • IMPORTANT Once you have PLS-graph installed (by clicking on that link and running the file), make sure you have a "license.dat" or "license" file in your plsgraph folder, typically found at: C:\Program Files\plsgraph, but sometimes also found at C:\Program Files x86\plsgraph if you are running a 64bit machine.


Opening PLS-graph

If you get one of the following errors, then you either don't have a valid license from Wynne Chin, or you have not placed the license in the proper directory. See the installation section above for more details.


Linking Data

To "open" or "link" a dataset in PLS-graph, you need to click on File --> Links NOT File --> Open. From File --> Links, then browse for your dataset.

  • Your data must be in .raw format or else it will not show up in the browse window.

To get your data in .raw format, you need to follow the guidelines in this quick tutorial: Creating .raw files

Basically, you need to:

  • Save your dataset as "tab delimited" (.dat) from SPSS or Excel, or whatever program you are using to view your data.
  • Then change the file extension from .dat to .raw (say yes if an error pops up)

If you get the following error when linking data, then there is a problem in the dataset:


The problem is most likely one of the following:

  • [1]You have blank or missing values that have not been recoded.
  • [2]You have non-numeric values (other than variable names in the first row)
  • [3]You have excessively large numbers (e.g., 0.978687677664826355281)
  • [4]You have scientific notation (e.g., 3.23E-08 instead of 0.0000000323)

Fixes for these issues:

  • [1]Replace all missing values in your dataset with a constant that is otherwise unused in the dataset (something like -1). You can do this in Excel or SPSS by doing a quick Find and Replace (Control+H). Or, you can impute those missing values (if appropriate) in SPSS using the Replace missing values function in the Transform menu.
  • [2]If you have non-numeric data, you need to convert it to numbers (if appropriate). For example, if you have values like "Low" "Medium" "High", instead you need to use something like "1" "2" "3", where 1=Low, etc. This can also be done with a find and replace. You may also need to simply remove some columns from your dataset because they cannot be used in PLS. For example, if you have email addresses or usernames in your dataset, those can simply be removed because they cannot be meaningfully converted into numeric data.
  • [3]If your numbers are too large in Excel, then simply decrease the number of decimals using the Decimal.png button. If you are using SPSS then you need to do some fancy copy and paste work. Copy the offending columns into Excel, reduce the number or decimals, then copy and paste the new values into those same columns in SPSS.
  • [4]If you have scientific notation, this is probably because you were using Excel at some point and the numbers were either formatted explicitly as "Scientific" or were inferred to be Scientific but formatted by default as "General". To fix this, simply change the formatting to "Number". See the picture below for how to access these formats in Excel 2010.



PLS-graph tends to crash quite frequently if you are testing a complex model (over 30 variables) and/or have a large sample size (over 500). To fix this, you need to manually increase the amount of memory allocated for running the PLS algorithms. Go to Options --> Memory and then add a couple zeros to each row. In the picture below, I've added two zeros to each row.


You may also want to just wait for a few seconds after the program runs before hitting the Okay button. This will give it time to settle, and will result in fewer crashes.

  • Above all SAVE OFTEN!

Sample Size Rule

PLS has the great advantage over Covariance Based Methods (CBM) because it requires fewer datapoints to accurately estimate loadings. The rule for CBM is 10 times the number of parameters or variables in the model. So if you have 20 variables then you need 200 usable rows in your dataset. In PLS, the rule is much looser. In PLS you need 10 times the number of indicators for the most predicted construct. So for example, if you have a latent construct that is predicted by 6 indicators, another predicted by 3, and another predicted by 4, then you would only need 6 times 10, or 60 usable rows. If a construct is also being predicted in a causal model by other latent constructs, then those need to be considered as well. So for example, in the model below, the required sample size would be 90: 70 for the measured indicators and 20 for the latent indicators.


Factor Analysis

By factor analysis I mean the measurement and estimation of latent constructs, excluding any causal relationship between latent constructs. In PLS latent constructs can be estimated formatively or reflectively; whereas in CBM all constructs are measured reflectively. The difference between these two types of models is important and should not be disregarded. For more information on reflective versus formative measures and models please refer to the section on Formative vs. Reflective models. As for how to conduct a factor analysis in PLS-graph, it would be much simpler just to show you, so please see the video above.

Testing Causal Models

The first video listed above demonstrates the entire process of testing a causal model. The second video explains how to fix your bootstrapping options so that you can increase your t-statistic (and as a result, decrease your p-value).

The basic steps for testing causal models are as follows:

  • After doing a factor analysis in PLS-graph, connect the constructs using the connector tool File:Connector.png
  • Change the inner weightings to Path (instead of Factor) - this is in the Options --> Run menu.
  • Run the model
  • Trim weak paths in the model
  • Run a bootstrap in order to obtain t-statistics, composite reliabilities, and AVEs.
  • Trim indicators based on the t-statistics
  • Compute p-values from t-statistic (using Excel's function: =T.DIST.2T(x,deg_freedom))

The basic steps for increasing your t-statistic are as follows:

  • Go to Options --> Resampling
  • Change the Number of Samples to be a number greater than your sample size
  • Change the Cases per Sample to be a number that is a majority of your sample size. Or, just put a zero there and the bootstrap will use the entire sample size, but include replacements (or estimated values) for removed cases. The latter option here (using zero) will usually give you the highest t-statistic.
  • Run the bootstrap now as usual.

Effect Strength (f-squared)

The t-statistic produced in pls-graph and used to calculate p-values is easily inflated when using a large sample size (something greater than 300). So you can run a model and have a path coefficient of 0.048 and yet the t-statistic will be significant. But a path coefficient of 0.048 is not practically significant, only statistically significant. In cases like these, the best thing to do is to calculate an f-squared to demonstrate the actual strength of the effect. The f-squared relies on the change in the r-squared, rather than on the size or significance of the path coefficient. The f-squared is calculated as follows:


I have made a quick tool for calculating this in Excel. It is in the EffectSize tab of the Stats Tools Package. Looks like this:


Testing Group Differences

Testing for differences between groups for a given causal model is a big pain in PLS... just as it is in CBM software like AMOS. Hopefully the video tutorial I've made will demystify the process. You will also need the Stats Tools Package Excel workbook referenced on the StatWiki homepage. The basic steps are as follows:

  • Use case selection to run the model for one group at a time. This can be found in the Options --> Run menu. For example, you might use gender as the grouping variable. In your dataset, gender should be indicated by a 1 or 2, where 1 = male and 2 = female. Then in PLS-graph you can select gender as the selection variable and specify a value (such as 1) to test for just one gender at a time. (see the picture below)
  • Then obtain the regression weight from running the model
  • To obtain the standard errors, you need to run a bootstrap using a separate dataset for each group. Bootstrapping in pls-graph does not take into account specifying a case selection (as in the picture below).
  • Plug these values, along with the sample size for each group, into the Stats Tools Package X2 Threshold tab.
  • This will calculate a t-statistic and p-value for you. The larger the sample sizes, the stronger the p-value


Handling Missing Data

PLS-graph cannot handle missing values that are left blank. If there are blank portions of your dataset, the data simply will not load. To avoid having to remove or impute all these missing values, you can just recode them using some constant number that is never used elsewhere in the dataset. For example, if your data comes from surveys that used 7-point Likert scales, then you could use the number 8 as the proxy for missing values, or the number -1, or 1,111,111,001, or whatever you wanted, as long as it wasn't a number 1 through 7. Common practice is to use -1. In SPSS or Excel, just hit Control+H and replace all blanks with a -1. WARNING in SPSS, this will only replace missing values within a specified column, whereas in Excel it will replace missing values for the entire dataset. In SPSS it looks like this, with the Find value blank, and the Replace value set to -1:


Then, in PLS-graph, you need to specify the value for missing data. This is done in the Options --> Run menu.

Reliability and Validity

So how do you test for reliability and validity in PLS-graph? And what do you do with formative measures? There are different schools of thought, and different approaches. In the video above, I will show you one of these that is usually acceptable (depending on reviewers). The basic guidelines are as follows:

  • Reliability: This is demonstrated by Composite Reliability greater than 0.700.
  • Convergent Validity: This is demonstrated by loadings greater than 0.700, AVE greater than 0.500, and Communalities greater than 0.500
  • Discriminant validity: This is demonstrated by the square root of the AVE being greater than any of the inter-construct correlations.
  • Formative Measures: Like I said, different schools of thought. Some say that Reliability and Convergent validity are actually flawed metrics when evaluating formative measures because formative measures do not necessarily have highly correlated indicators. However, the formative measure should have some common theme. Thus, I argue that for formative measures, high loadings and communalities should still be present in order to have a strong construct. Nevertheless, if you don't achieve the recommended thresholds, you can probably argue your case.

In the end, you want a table that looks something like this:


Common Method Bias

There are several different methods for testing whether the use of a common method introduced a bias into your data. My preferred method, and probably the most accurate, but most stringent, is to use a marker variable to draw out the common variance with theoretically unrelated constructs, which would point to some systematic variance explained by an external factor (such as a common method of data collection). To employ a marker variable in PLS-graph, you need to create a latent construct that is theoretically dissimilar to the other constructs in the model. For example, if I am doing a factor analysis with the following variables: Satisfaction, Burnout, Rejection, and Ethical Concerns, I can choose a marker variable like Apathy, and then look at the correlations between the other constructs and this construct. The correlations should be low - like less than 0.300. Squaring the highest correlation between the Marker and another construct will give you the maximum percentage of shared variance. Additionally, you can look at the correlations between the other factors. None of those correlations should be greater than 0.700 (for discriminant validity) and definitely no greater than 0.900 for common method bias.

So, given the correlation matrix below (from the .lst output), we can say that the maximum shared variance with the Marker variable is less than 1% (.075 squared), and none of the other correlations begin to approach the 0.900 threshold. Thus, there is no evidence that a common method bias exists.



To perform an interaction in PLS-graph, you need to create an Interaction Construct that is composed of the products of the indicators for the IV and the moderating variable. The picture below on the left is the conceptual model we are testing. The picture below on the right is the way we measure it in PLS-graph.


  • Standardizing variables before multiplying them for interactions is no longer considered necessary, as the assumed benefit of reducing multicollinearity has been debunked in several recent articles.

To test the significance of the effect, just do a bootstrap like you would for any other effect, then calculate the p-value from the t-statistic as discussed in the Testing Causal Models section.


Here are video demonstrations using SmartPLS for most of the analyses above:

And here is a pretty good set of slides made by Joseph Hair (as in Hair et al. 2010) about SEM and PLS, and he uses SmartPLS:

To cite any of the YouTube videos, refer to our IEEE TPC PLS article:

  • Paul Benjamin Lowry and James Gaskin (2014). “Partial least squares (PLS) structural equation modeling (SEM) for building and testing behavioral causal theory: When to choose it and how to use it,” IEEE Transactions on Professional Communication (accepted 04-Mar-2014).