Skip to content

Feature/knowledge connector sharepoint#870

Open
tmccaughey wants to merge 7 commits intomasterfrom
feature/knowledge-connector-sharepoint
Open

Feature/knowledge connector sharepoint#870
tmccaughey wants to merge 7 commits intomasterfrom
feature/knowledge-connector-sharepoint

Conversation

@tmccaughey
Copy link
Contributor

Added the Knowledge Connector for Sharepoint

Copilot AI review requested due to automatic review settings January 29, 2026 10:35
@graymalkin77
Copy link

graymalkin77 commented Jan 29, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@tmccaughey tmccaughey requested a review from falak-asad January 29, 2026 10:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a Knowledge Connector for Microsoft SharePoint that enables extracting and indexing documents from SharePoint sites into Cognigy's knowledge base system.

Changes:

  • Added SharePoint Knowledge Connector with support for multiple file types (PDF, DOCX, TXT, CSV, JSON, JSONL, MD, PPTX)
  • Implemented file extraction, text chunking, and knowledge source creation
  • Updated dependencies to include LangChain libraries for document processing

Reviewed changes

Copilot reviewed 11 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
package.json Added base dependencies for extension tools and axios
extensions/microsoft-sharepoint/tsconfig.json Added skipLibCheck configuration
extensions/microsoft-sharepoint/src/module.ts Registered new connector and connection type
extensions/microsoft-sharepoint/src/knowledge-connectors/sharepoint-connector.ts Main connector implementation with file processing logic
extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/utils/config.ts Configuration utilities for chunk size management
extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/text_extractor.ts Document text extraction with LangChain loaders
extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/text_chunker.ts Text splitting functionality for document chunking
extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/list_files.ts SharePoint file listing with filtering
extensions/microsoft-sharepoint/src/knowledge-connectors/helpers/chunk_extractor.ts File download and chunk extraction orchestration
extensions/microsoft-sharepoint/src/connections/connectorConnection.ts Connection schema for SharePoint authentication
extensions/microsoft-sharepoint/package.json Updated dependencies including LangChain packages

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fields: [
{ fieldName: "tenantId" },
{ fieldName: "clientId" },
{ fieldName: "clientSecret"},
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing: missing space before the closing brace. Should be { fieldName: \"clientSecret\" }, to match the formatting of the other fields.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

…rs/chunk_extractor.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 29, 2026 10:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
} catch (error) {
// Continue with next file even if this one fails
console.error(`Failed to process file ${file.name}:`, error);
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message doesn't provide actionable context. Consider including the drive name and file path, and suggesting next steps.

Suggested change
console.error(`Failed to process file ${file.name}:`, error);
console.error(
`Failed to process file "${file.name}" (id: ${file.id}) in drive "${drive.name}" (id: ${driveId}). ` +
`Check that the file exists in SharePoint, that the connector has access permissions, and that the file type is supported.`,
error
);

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@falak-asad I might just remove the logs again?

Comment on lines 13 to 14
export const logger = {
log: (level: string, context: any, message: string) => {
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom logger object should have a more specific name like 'customLogger' or 'textExtractorLogger' to avoid confusion with standard console logging.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 16, 2026 09:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +153 to +157
if (!pptx)
return [];
return [
new Document({
pageContent: pptx,
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parseOfficeAsync function from the officeparser package is called without any error handling beyond the try-catch wrapper. According to the library documentation, if the parsing fails, it may return an empty string or undefined. The code checks for !pptx but should also handle the case where pptx is an empty string to avoid creating empty knowledge chunks.

Suggested change
if (!pptx)
return [];
return [
new Document({
pageContent: pptx,
// Handle cases where parsing fails or returns empty/meaningless content
if (!pptx) {
return [];
}
const cleanedContent = removeUnnecessaryChars(pptx);
if (!cleanedContent) {
return [];
}
return [
new Document({
pageContent: cleanedContent,

Copilot uses AI. Check for mistakes.
tags: sourceTags as string[],
chunkCount: chunks.length,
contentHashOrTimestamp: contentHash,
externalIdentifier: file.name,
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue with external identifier uniqueness. Using file.name as the externalIdentifier may cause collisions if files with the same name exist in different drives or folders. Consider using a more unique identifier such as a combination of driveId and file.id, or the full file path, to prevent potential conflicts and ensure proper cleanup logic.

Copilot uses AI. Check for mistakes.
} finally {
// Clean up temp file
if (fs.existsSync(tempFilePath)) {
try { fs.unlinkSync(tempFilePath); } catch (e) { /* ignore cleanup errors */ }
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temp file cleanup may fail silently in the catch block, potentially leaving temporary files on disk if the unlink operation fails. While the error is caught to prevent crashing, accumulated temp files could eventually consume disk space. Consider logging a warning when cleanup fails to help diagnose issues in production.

Suggested change
try { fs.unlinkSync(tempFilePath); } catch (e) { /* ignore cleanup errors */ }
try {
fs.unlinkSync(tempFilePath);
} catch (e) {
console.warn(`Warning: Failed to remove temp file at ${tempFilePath}:`, e);
}

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +76
// Convert split documents to ChunkContent format
const chunks: ChunkContent[] = splitDocuments
.slice(0, MAX_CHUNKS_PER_FILE)
.map((doc, index) => ({
text: doc.pageContent,
data: {
title: `${fileName} - Part ${index + 1}`,
source: fileName,
fileType: fileExtension,
},
}));
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded chunk limit of 500 chunks per file may not be appropriate for all use cases. Large documents could be truncated without warning to the user. Consider either making this configurable through the connector fields or logging a warning when a file exceeds this limit so users are aware that content is being truncated.

Suggested change
// Convert split documents to ChunkContent format
const chunks: ChunkContent[] = splitDocuments
.slice(0, MAX_CHUNKS_PER_FILE)
.map((doc, index) => ({
text: doc.pageContent,
data: {
title: `${fileName} - Part ${index + 1}`,
source: fileName,
fileType: fileExtension,
},
}));
const totalChunks = splitDocuments.length;
if (totalChunks > MAX_CHUNKS_PER_FILE) {
console.warn(
`Truncating SharePoint file "${fileName}": ` +
`${totalChunks} chunks generated, but only the first ` +
`${MAX_CHUNKS_PER_FILE} will be kept. ` +
`${totalChunks - MAX_CHUNKS_PER_FILE} chunks will be discarded.`
);
}
// Convert split documents to ChunkContent format
const limitedDocuments = splitDocuments.slice(0, MAX_CHUNKS_PER_FILE);
const chunks: ChunkContent[] = limitedDocuments.map((doc, index) => ({
text: doc.pageContent,
data: {
title: `${fileName} - Part ${index + 1}`,
source: fileName,
fileType: fileExtension,
},
}));

Copilot uses AI. Check for mistakes.
Comment on lines 105 to 151
// Process each file
for (const file of files) {
try {
const fileExtension = path.extname(file.name).slice(1).toLowerCase();

// Get chunks for this file
const chunks = await getSharePointFileChunks(
accessToken,
driveId,
file.id,
file.name,
fileExtension
);

if (chunks.length === 0) continue;

// Compute content hash to support safe upserts
const contentHash = createContentHash(chunks);

// Upsert knowledge source so we can skip re-ingestion if unchanged
const knowledgeSource = await api.upsertKnowledgeSource({
name: file.name,
description: `Data from ${file.name} in SharePoint library ${drive.name}`,
tags: sourceTags as string[],
chunkCount: chunks.length,
contentHashOrTimestamp: contentHash,
externalIdentifier: file.name,
});

if (knowledgeSource) {
// Create all chunks (the runtime may optimise if content unchanged)
for (const chunk of chunks) {
await api.createKnowledgeChunk({
knowledgeSourceId: knowledgeSource.knowledgeSourceId,
text: chunk.text,
data: chunk.data,
});
}
}

// Track processed external identifiers for cleanup
newSources.push(file.name);
} catch (error) {
// Continue with next file even if this one fails
console.error(`Failed to process file ${file.name}:`, error);
continue;
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential processing of files and chunks with await in loops will cause significant performance issues for SharePoint libraries with many files. Each file is processed one at a time, and within each file, chunks are created one at a time. For a library with hundreds of files, this could take hours. Consider using Promise.all() or batch processing to process multiple files in parallel, and batch-create chunks to improve performance.

Suggested change
// Process each file
for (const file of files) {
try {
const fileExtension = path.extname(file.name).slice(1).toLowerCase();
// Get chunks for this file
const chunks = await getSharePointFileChunks(
accessToken,
driveId,
file.id,
file.name,
fileExtension
);
if (chunks.length === 0) continue;
// Compute content hash to support safe upserts
const contentHash = createContentHash(chunks);
// Upsert knowledge source so we can skip re-ingestion if unchanged
const knowledgeSource = await api.upsertKnowledgeSource({
name: file.name,
description: `Data from ${file.name} in SharePoint library ${drive.name}`,
tags: sourceTags as string[],
chunkCount: chunks.length,
contentHashOrTimestamp: contentHash,
externalIdentifier: file.name,
});
if (knowledgeSource) {
// Create all chunks (the runtime may optimise if content unchanged)
for (const chunk of chunks) {
await api.createKnowledgeChunk({
knowledgeSourceId: knowledgeSource.knowledgeSourceId,
text: chunk.text,
data: chunk.data,
});
}
}
// Track processed external identifiers for cleanup
newSources.push(file.name);
} catch (error) {
// Continue with next file even if this one fails
console.error(`Failed to process file ${file.name}:`, error);
continue;
}
// Process each file in small concurrent batches to avoid sequential bottlenecks
const FILE_BATCH_SIZE = 5;
for (let i = 0; i < files.length; i += FILE_BATCH_SIZE) {
const batch = files.slice(i, i + FILE_BATCH_SIZE);
await Promise.all(
batch.map(async (file) => {
try {
const fileExtension = path.extname(file.name).slice(1).toLowerCase();
// Get chunks for this file
const chunks = await getSharePointFileChunks(
accessToken,
driveId,
file.id,
file.name,
fileExtension
);
if (chunks.length === 0) {
return;
}
// Compute content hash to support safe upserts
const contentHash = createContentHash(chunks);
// Upsert knowledge source so we can skip re-ingestion if unchanged
const knowledgeSource = await api.upsertKnowledgeSource({
name: file.name,
description: `Data from ${file.name} in SharePoint library ${drive.name}`,
tags: sourceTags as string[],
chunkCount: chunks.length,
contentHashOrTimestamp: contentHash,
externalIdentifier: file.name,
});
if (knowledgeSource) {
// Create all chunks concurrently (the runtime may optimise if content unchanged)
await Promise.all(
chunks.map((chunk) =>
api.createKnowledgeChunk({
knowledgeSourceId: knowledgeSource.knowledgeSourceId,
text: chunk.text,
data: chunk.data,
})
)
);
}
// Track processed external identifiers for cleanup
newSources.push(file.name);
} catch (error) {
// Continue with next file even if this one fails
console.error(`Failed to process file ${file.name}:`, error);
return;
}
})
);

Copilot uses AI. Check for mistakes.
{ fieldName: "tenantId" },
{ fieldName: "clientId" },
{ fieldName: "clientSecret" },

Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line before the closing bracket. Remove the blank line at line 10 to maintain consistent formatting with other connection schemas in the codebase.

Suggested change

Copilot uses AI. Check for mistakes.
import * as path from 'path';
import * as crypto from "crypto";


Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line after imports. Remove the extra blank line at line 8 to follow consistent formatting patterns seen in other knowledge connectors in the codebase.

Suggested change

Copilot uses AI. Check for mistakes.
"sourceMap": true,
"moduleResolution": "node",
"watch": false
"skipLibCheck": true
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'watch' property was removed from tsconfig.json. While 'skipLibCheck' was added (which is good for build performance), removing 'watch: false' changes the default TypeScript behavior. If watch mode was explicitly disabled before, this change may cause unintended watch mode activation. Verify that this change is intentional and won't affect the build process.

Copilot uses AI. Check for mistakes.
Comment on lines 147 to 150
} catch (error) {
// Continue with next file even if this one fails
console.error(`Failed to process file ${file.name}:`, error);
continue;
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling silently continues on failure but error messages may contain sensitive information or stack traces that could expose internal details. While the continuation logic is correct to keep processing other files, ensure that error logging is controlled in production environments to avoid exposing sensitive data.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 18, 2026 09:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1 @@
{}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This root-level package.json is empty, but the root package-lock.json declares dependencies. This mismatch will break reproducible installs (npm will consider the lock out of sync) and is inconsistent with the repo’s per-extension package.json structure. Either remove the root package.json/package-lock.json entirely, or define the intended root dependencies and regenerate the lockfile from that package.json.

Suggested change
{}
{
"name": "root",
"version": "1.0.0",
"private": true
}

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +49
const url = `${baseUrl}?$top=1000`;

const response = await axios.get(url, {
headers: {
'Authorization': `Bearer ${accessToken}`,
'Content-Type': 'application/json'
}
});

if (!response.data.value) {
return [];
}

// Filter for files only (not folders), non-empty, and supported types
const sharePointFiles: SharePointFile[] = response.data.value
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment claims pagination support, but the implementation only fetches the first page (?$top=1000) and ignores @odata.nextLink. Either implement pagination by following @odata.nextLink until exhausted, or update the comment to avoid promising behavior that doesn’t exist.

Suggested change
const url = `${baseUrl}?$top=1000`;
const response = await axios.get(url, {
headers: {
'Authorization': `Bearer ${accessToken}`,
'Content-Type': 'application/json'
}
});
if (!response.data.value) {
return [];
}
// Filter for files only (not folders), non-empty, and supported types
const sharePointFiles: SharePointFile[] = response.data.value
let url = `${baseUrl}?$top=1000`;
const allItems: any[] = [];
while (url) {
const response = await axios.get(url, {
headers: {
'Authorization': `Bearer ${accessToken}`,
'Content-Type': 'application/json'
}
});
if (!response.data || !response.data.value) {
break;
}
allItems.push(...response.data.value);
const nextLink = response.data['@odata.nextLink'];
url = nextLink && typeof nextLink === 'string' ? nextLink : '';
}
if (allItems.length === 0) {
return [];
}
// Filter for files only (not folders), non-empty, and supported types
const sharePointFiles: SharePointFile[] = allItems

Copilot uses AI. Check for mistakes.
Comment on lines 65 to 68
// Log first few files for debugging
sharePointFiles.slice(0, 3).forEach((file, index) => {
console.log(`File ${index + 1}: ${file.name} (${file.size} bytes)`);
});
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug logging will run on every connector sync and can create noisy logs (and potentially leak document names). Consider removing it or gating it behind an explicit debug flag (e.g., env var) so production runs stay quiet by default.

Suggested change
// Log first few files for debugging
sharePointFiles.slice(0, 3).forEach((file, index) => {
console.log(`File ${index + 1}: ${file.name} (${file.size} bytes)`);
});
// Log first few files for debugging (enabled only when explicitly requested)
if (process.env.SHAREPOINT_CONNECTOR_DEBUG === 'true') {
sharePointFiles.slice(0, 3).forEach((file, index) => {
console.log(`File ${index + 1}: ${file.name} (${file.size} bytes)`);
});
}

Copilot uses AI. Check for mistakes.
Comment on lines 125 to 135
// split document into paragraphs according to specified or default splitter
const splitDocuments = (
await splitDocs(docs)
).map((doc) => doc.pageContent);

// join the paragraphs into the format we want
const textParagraphs = splitDocuments.join('\n\n');

logger.log("info", null, "Successfully used langchain to extract content");

return textParagraphs;
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lsExtractor already splits documents into chunks (via splitDocs) and then rejoins them into a single string, but getSharePointFileChunks immediately calls splitDocs again on that returned string. This duplicates work and makes it unclear where chunking is supposed to happen. Consider having lsExtractor return extracted text only (no chunking), or return split Document[] and let chunk_extractor be the single place that chunks.

Suggested change
// split document into paragraphs according to specified or default splitter
const splitDocuments = (
await splitDocs(docs)
).map((doc) => doc.pageContent);
// join the paragraphs into the format we want
const textParagraphs = splitDocuments.join('\n\n');
logger.log("info", null, "Successfully used langchain to extract content");
return textParagraphs;
// Join cleaned document contents into a single string; downstream code is responsible for chunking
const fullText = docs.map((doc) => doc.pageContent).join('\n\n');
logger.log("info", null, "Successfully used langchain to extract content");
return fullText;

Copilot uses AI. Check for mistakes.
{ fieldName: "tenantId" },
{ fieldName: "clientId" },
{ fieldName: "clientSecret" },

Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: there’s an extra blank line in the fields array, and indentation differs from the other connection schema files in this extension. Cleaning this up will keep formatting consistent and reduce diffs over time.

Suggested change

Copilot uses AI. Check for mistakes.

// Save to temp file (text_extractor needs file path)
const tempDir = os.tmpdir();
const tempFileName = `${Date.now()}_${fileName}`;
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temp file name is derived from fileName. To avoid filesystem issues (invalid characters/very long names) and to guarantee uniqueness, prefer generating the temp file name from stable IDs (e.g., fileId) and/or sanitize the original file name before using it on disk.

Suggested change
const tempFileName = `${Date.now()}_${fileName}`;
const tempFileName = `${fileId}_${Date.now()}`;

Copilot uses AI. Check for mistakes.
const tempFileName = `${Date.now()}_${fileName}`;
const tempFilePath = path.join(tempDir, tempFileName);

fs.writeFileSync(tempFilePath, bodyContents as any);
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses synchronous filesystem I/O (writeFileSync) inside an async workflow. When processing many files, this will block the Node.js event loop and can significantly reduce throughput. Prefer await fs.promises.writeFile(...) (and similar async APIs for cleanup) to keep the connector responsive.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants