An Evaluation of Web Mining Application in
Social Networks

Sreedhar Appalabatla; Dr. Naveen Kumar; Dr. Mungamuru Nirmala

An Evaluation of Web Mining Application in Social Networks

Sreedhar Appalabatla, Dr. Naveen Kumar, Dr. Mungamuru Nirmala

Research Scholar, Dept. of Computer Science and Engineering, University of Allahabad, India,
Professor, Dept. of Computer Science and Engineering, University of Allahabad, India.
Assistant Professor, Dept of Computing, Adama Science and Technology University, Adama , Ethiopia

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

Social Networks have experienced a meteoric rise recently. They provide a number of functionalities such as network of friends or business contacts listings, content-sharing, profile surfing, discussion and messaging tools. Interoperability among Social Networks being a key challenge, the Google-powered Open-Social alliance has partly solved it and unveiled a new breed of strategies to gather data from Social Network users. In this paper, we discuss about on the Open-Social functionality and combine it with filtering and ranking algorithm to enhance email management. We analyze the traffic-weighted Web host graph obtained from a large sample of real Web users. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the wellstudied Boolean link host graph and others pointing to important differences.

Keywords

Open Social Network Dataset (OSND), Latent Semantic Analysis (LSA), Web traffic, Web host graph.

INTRODUCTION

Social Networks provide the means to explicitly create and manage connections based on information gathered and stored in user profiles. Social Networks and Semantic Social Networks [7] have emerged as a second generation of the mailing lists, Usenet, bulletin boards online communities, providing a number of services such as network of friends or business contacts listings, content-sharing, profile surfing, discussion and messaging tools. They are also part of the recently created new breed of user generated content aware technologies which have been encompassed by the “Web 2.0” buzzword umbrella and have turned up to provide a huge amount of metadata and information about the user as a particular entity.

However, these applications are not addressing fundamental problems of information overload, such as email hoarding or lack of management, but contributing to increase the burden. On the other hand, efforts such as [5] and [6] are under way to examine email filtering and ranking based on social networks. In addition, semantic technologies are evolving to a more mature state in which ontology [1], its backbone technology; provide a formal representation of a domain. The shift enabled by the use of machine understandable ontology can outperform the current endeavors that require finding data spread out across the Web or dynamically drawing inferences which are continually hampered by their reliance on adhoc data frameworks. A Google-powered Open Social based strategy in the context of Social Network user information is presented.

Open Social is an application programming interface to build social applications across the Web, in other words, a common set of APIs for social applications across multiple websites. Open Social is currently being developed by Google in conjunction with members of the web community. The ultimate goal is for any social website to be able to implement the APIs and host 3rd party social applications. There are many websites implementing Open Social, including Engage.com, Friendster, hi5, Hyves, imeem, LinkedIn, MySpace, Bebo, Ning, Oracle, orkut, Plaxo,

Salesforce.com, Six Apart, Tianji, Viadeo, and XING [3]. Open Social is not a social network itself; rather it is a set of three common APIs that allow developers to access the following core functions and information on social networks:

 People and Friends data API

 Activities data API

 Persistence data API

The Open Social Network Dataset (OSND) is a lightweight ontology used for collaborative data filtering and rating in which we follow an integrated approach of combining three types of techniques for improving its construction from the tag sets gathered from the afore mentioned Web 2.0 social networks such as Engage.com, Friendster, hi5 etc., The three techniques we are applying are as follows:

 Applying the Vector Space Model:

 Using Latent Semantic Analysis (LSA)

 Validating the set of terms pertaining to the OSND with online lexical resources, such as Wordnet1.

II. RELATED WORK

Many studies have used Web crawlers to reveal important insights on the large-scale structure of the Web graph, such as the “bow-tie” model, the presence of self-similar structures and scale-free distributions, and its smallworld topology [2, 4, 1, 6, 5]. While these insights have informed the design of a variety of applications such as crawlers and caching proxy servers, structural analysis has seen its greatest application in ranking pages returned by search engines. In particular, the well-known PageRank and HITS algorithms are able to use the pattern of links connecting pages to rank them without needing to process their contents; these algorithms have inspired a vast amount of research into ranking algorithms based on link structure [8, 9].

The structural properties of the link graph extend to the host graph, which considers the connectivity of entire Web servers rather than individual pages [10]. The earliest efforts have used browser logs to characterize user navigation patterns, time spent on pages, bookmark usage, page revisit frequencies, and overlap among user paths [11, 12]. The most direct source of behavioral data comes from the logs of Web servers, which have been used for applications such as personalization and improving caching behavior [13]. Because search engines serve a central role in users’ navigation, their log data is particularly useful in improving results based on user behavior [14, 15].

Ranking Web pages and sites is one of the most critical tasks of any search engine. The last decade brought terrific advances in Web search technology, owing in large part to the development of sophisticated ranking techniques. While modern search engines have likely refined and improved on Page Rank, in addition to combining it with many other criteria, it remains a reference tool for the study of the Web as a complex dynamic network, as well as for the engineering of improved ranking functions.

III. EVALUATION

In principle it is possible to capture the entire URLs of the referring and requested pages with our experimental setup, and to build a weighted link graph with pages as nodes. This is indeed our goal. In this Paper, however, we report on an initial stage in which we focus on the host graph. One reason is that this is more feasible with our current storage and computing resources, and indeed necessary to tune our collection and analysis algorithms; another is that the host graph already reveals several interesting insights about Web traffic. The web host graphs are stored as sparse connectivity matrices for analysis in Matlab. Node size is proportional to the log of the traffic to each site, and edge thickness is proportional to the log of the number of clicks on links between two sites.

A. Structural Properties

The click data was collected over a period. Table 1 offers a view of a small portion of the resulting weighted host graph, consisting of the most popular destination sites and the most clicked links between them.

We first report on general properties of this data and on the structure of the weighted host graph. Each human page click involves an average of 14.2 HTTP requests for embedded media files, style sheets, script files, and so on. One notable observation is that a majority of human-generated clicks do not have a referrer page, meaning that users type the URL directly, click on a bookmark, or click on a link in an email.

The first question about the host graph reconstructed from our sample of traffic is whether it recovers the wellknown topological features of the link graphs built from large-scale crawls [2, 4, 6]. The most stable signature of the Web graph is its scale-free in-degree distribution, which many studies consistently report as being well fitted by a power law with exponent. As shown in Figure 3 the behavior is recovered from the FULL host graph (= 2.2 ± 0.1); although Web traffic may not follow on every link, it produces a picture of the Web that is topologically consistent with those obtained from large-scale crawls. The power-law in-degree distribution in the HUMAN host graph has a slightly larger exponent = 2.3±0.1. This hints at an important caveat. While the structure of the trafficinduced and crawler-induced networks may be similar, they are based on very different sampling procedures, each with its own biases.

One cannot compare the two networks directly on a node-by-node basis. To illustrate this point, nodes are sampled from the HUMAN graph and compared their in-degree with that given by a search engine (via the Yahoo API). As evident from the scatter plot in Figure 3, the correlation is weak (Pearson’s R = 0.26 on the log-values), and we cannot assume proportionality. If one conjectures a power-law scaling

where

is the in-degree obtained from crawl data, we see that a sub linear bias < 1 fits the data better than proportionality 1. While we cannot say that such a power-law scaling is the most appropriate model of the relationship, this does highlight a sample bias whereby the in-degree of popular nodes is underestimated by a greater amount than that of low-degree nodes. The lack of proportionality explains the higher exponent in the power-law distribution of in-degree. Assuming again that kin and ˆkin are deterministically related by the power formula conjectured above, it follows immediately that

Therefore

and thus theKin exponent changes to

The difference between our network representation of the Web host graph and that obtained from crawls, of course, is that we have weighted edges telling how many times links between hosts are clicked. For weighted networks, the notion of degree is generalized to that of strength, defined as the sum of the weights over incoming or outgoing links:

Where wij is the weight of edge (i, j), i.e. the number of clicks on the link from host i to host j. Note that because sin(j) represents the total number of times that site j is visited, this is what we refer to by the less formal term traffic.

IV. RESULTS AND DISCUSSIONS

All comment sections have been left how they were entered on the web, most spelling errors. This will allow you to see the responses that are received; only a few irrelevant results have been removed. In total 232 people responded to the questionnaire, the majority being students making up 60% of the total respondents. 32% of respondents were staff and the remainder (8%) in the other category. The other category contained the following responses. Govt Agent, PhD students Research student, university alumni, site visitor, university Graduate applicant, Graduate, Parent of student, Prospective pg student etc

The majority of respondents (75%) were aged between 19 and 35, the second largest groups were aged between 35 and 50.

Internet Explorer is the most popular browser software used to navigate our site.

E. Search Facility

The search function is not used at all by 22% of respondents, 43% use it sometimes but only 14% use it frequently or always use it.

V. CONCLUSIONS AND FUTURE WORK

An effort has been made to analyze Open social networks. The initial of my research is focused and presented in this paper in brief. Researchers have been quick to recognize that structural analysis of the Web can become far more useful when combined with behavioral data. Some paths through the Web are used far more heavily than others, and a variety of behavioral data sources exist that can allow researchers to identify these paths and improve Web models accordingly. The earliest efforts have used browser logs to characterize user navigation patterns, time spent on pages, bookmark usage, page revisit frequencies, and overlap among user paths Because search engines serve a central role in users’ navigation, their log data is particularly useful in improving results based on user behavior. .However, these applications are not addressing fundamental problems of information overload, such as email hoarding or lack of management, but contributing to increase the burden.

Tables at a glance

Table 1

Figures at a glance


Figure 1	Figure 2	Figure 3	Figure 4	Figure 5


Figure 6	Figure 7	Figure 8	Figure 9

References

Fensel, D, Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce, Springer-Verlag. 2002.

University of California Police Department, YouÃÂ¢Ãâ¬Ãâ¢ve Got Spam: How to Avoid Unwanted Email.

OpenSocial ÃÂ¢Ãâ¬Ãâ Google Code official web site. http://code.google.com/apis/opensocial/

Joseph S. Kong, Behnam A. Rezaei, NimaSarshar, and Vwani P. Roy chowdhury, Let Your Cyber Alter Ego Share Information andManage Spam, 2005.

Golbeck, J. and Hendler, J. Reputation Network Analysis for Email Filtering, Proceedings of Conference on Email and Anti-Spam.Mountain View, California, USA, 2004.

Ankolekar A, KrÃÆÃÂ¶tzsch M, and Vrandecic, D, 2007, The two cultures: Mashing up web 2.0 and the semantic web, Proceedings of the 16thinternational Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07, ACM Press, New York, NY, 825-834.

Gomez, J.M. Colomo, R. Ruiz, B Garcia : A Semantics-based Social Network for Software Project, International Journal of InformationTechnology and Management, Special issue: Work Change in the Era of ICTs. 2007.

Deerwester, S. Dumais, Furnas, G. W. Landauer, T. K. Harshman, R, Indexing by Latent Semantic Analysis, Journal of the Society forInformation Science 41, Issue 6. Pp 391-407. 1990.

Marsh, S. (1994), Formalizing Trust as a Computational Concept, PhD thesis, Department of Mathematics and Computer Science,University of Sterling.

Heyman, P. Garcia-Molina, H, Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems, TechnicalReport Stanford University, 2006.

Gomez, J.M. Colomo, R. Alor-Hernandez, G. Posada- Gomez, R. Garcia, A Search in the Eye of the Beholder: Using the Personal SocialDataset and Ontology-guided Input to Improve Web Search Efficiency, Proceedings of the 5th IEEE Latin-American Web Conference(LA-WEB07), Santiagode Chile, Chile. October, 31- November, 2nd 2007.

Christian Bird, Alex Gourley, Perm Devanbu, Michael Gertz, AnandSwaminathan. Mining Email Social Networks. MSRÃÂ¢Ãâ¬Ãâ¢06, May 22ÃÂ¢Ãâ¬Ãâ23,2006, Shanghai, China.

Ziegler, Cai-Nicolas, Georg Lausen (2004), Spreading Activation Models for Trust Propagation, Proceedings of the IEEE InternationalConference on E-Technology.

www.microsoft.com/mscorp/safety/technologies/senderid/default.mspx

Richardson, Matthew, Rakesh Agrawal, Pedro Domingos. (2003) ÃÂ¢Ãâ¬ÃÅTrust Management for the Semantic Web,ÃÂ¢Ãâ¬ÃÂ Proceedings of the SecondInternational Semantic Web Conference. Sanibel Island, Florida.